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TNTRODUCTIQN 

Field of the Invention: 

This invention relates to the area of bioinformatics, more specifically to methods for analyzing the 
sequences of evolutionarily related proteins, and most specifically for identifying evolutionary and 
functional relationships between proteins and the genes that encode them. 

Background 

Proteins are linear polypeptide chains composed of 20 different amino acid building blocks. 
Determining the sequence of amino acids in a protein is now experimentaUy routine, both by direct 
chemical analysis of the proteins themselves, and by translation of genes that encode proteins. Genome 
sequencing projects provide those genes in abundance, and the full complement of genes is known for 
many microorganisms, yeast, the worm C. elegans, and the fly Drosophila melanogaster. The size of 
protein sequence databases will grow explosively over the next decade as more genome sequencing 
projects are completed. 

WhUe genomic sequence data are widely believed to hold the key to a revolution in biology, and while 
then- impact on experimental work is akeady being felt, much of the revolution has still not materialized. 
Missing in particular are bioinformatics tools that extract information about biological function starting 
from genomic data and existing biological information, in a form tiiat can be used by biomedical 
researchers. This makes it difficult to asign a "utility" to a gene sequence, in turn making it difficult to 
capture commercial value from gene sequence data. This issue is found tiiroughout die commerce of 
modem genomics, including in die patenting of genes. It is stiU not clear what standard must be applied to 
establish that a gene patent appUcation meets die "utiUty" requirement of die code, but information 
concerning the biological fiinction of the gene for which a patent is sought is aUnost certainly helpftil. 
Tools that generate testable hypotiieses about the biological function of genes that encode proteins have 
utiUty precisely because diey enable genome sequences to be exploited commercially. 
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Once those tools are in place, it should be possible to mine genomic sequence data for information 
about pathways [Mar99], generate insights into complex biological function (e.g. development) [RubOO], 
and perhaps even identify pharmaceutical targets [PolOO]. Clearly, the payoff from a tool that could 
provide biological information from sequence data could be enormous. Experiments are expensive; 
genomic data are not (or, more precisely, their cost has aheady been amortized). 

For this reason, functional bioinformatic tools need not be generate "indisputable conclusions" for 
them to have a broad utility. The payoff would be substantial if the tool did noOiing more than suggest a 
hypothesis about tiie function of a gene, or rule out some fraction of a protein family as relevant for a 
function, botii ways of targeting subsequent experiments. Modem biomedical research, especiaUy that 
directed towards tiie development of therapeutic agents, is largely a random walk in any case. Functional 
bioinformatic tools would be valuable if they do notiiing more tiian favorably bias that walk. 

Generating hypotheses concerning possible biological functions that a gene might have is part of the 
"annotation problem" in modem genomics. "Annotation", and (indeed) "function", mean very different 
tilings to different people. Much of the Uteratiire regards "function" as equivalent to "behavior", what is 
1 measured in ttie laboratory. Crystallographers occasionally view "function" as equivalent to "structure", 
i Throughout this proposal, we distinguish between "homology" (relationship by common ancestiy), 
f "stincture" (at two levels, as known to those skiUed in tiie chemical arts, "constitution", meaning 
^ "sequence", and "conformation", which is commonly referred to as "the fold" or, incorrectiy, "the 
1 structure" by stiiictural biologists), "behavior" (what is measured in the laboratory), and "function". Under 
r Darwinian ttieory, "function" refers to adaptive behavior, properties tiiat confer fitaess, the abUity of an 
3 organism to survive and reproduce. The Darwinian paradigm holds that tiie only way to achieve function 
fi is by random variation of genetic structure (mutation) foUowed by natiiral selection. 
m It is important to keep these distinctions clear. A statement concerning tiie function of a gene is 
S ultimately a statement about how tiiat gene contributes to tiie fitness of tiie host organism. For this reason, 
homologous proteins in different species generally do not have "the same function", as different species 
have different requirements for fittiess. They may, however, have "analogous" functions. As we disclose 
below, even subtie differences in function between orthologous proteins in different organisms can be 
very interesting, and can be the key to deUvering a true understanding of "function" to biological and 
biomedical research scientists. Therefore, tools tiiat suggest (again, as a hypothesis) that tiie function of 
two homologous genes might be different are frequently as useful as those that suggest tiiat tiie two genes 

have analogous functions. 

In any case, tiie fact tiiat Darwinian processes have generated tiie genes tiu:ough natural selection 
acting on tiie encoded proteins means tiiat an evolutionary analysis must at some point be involved in any 
annotation procedure; an evolutionary analysis is necessary to analyze for function. Furtiier, as we shall 
disclose, an evolutionary analysis is sufficient to generate hypotiieses concerning function. As with any 
hypotiiesis in science, functional hypotiieses cannot be "proven". Ratiier, as witii any hypotiiesis in 
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science, they form part of a web of conjecture, hypothesis, data, and interpretation that builds a useful 
picture of function; utility comes long before "proof, in part because"proof is not possible for any 
substantive statement about the natural world. 

The most common way in which evolutionary analysis is used today to annotate sequences involves 
pairwise sequence comparisons to detect homology between two proteins. The conventional recipe for 
inferring the "function" of a target open reading frame (the target sequence) using evolutionary analysis 
follows five steps: 

1. Use the target sequence as a probe in a BLAST [Alt90] or FASTA [Lip85] search of the Genbank 
database (or an equivalent). 

2. Identify "hits", proteins in the database whose sequence resembles that of the target. 

3. Evaluate the hits based on a statistical model." 

4. Download the annotation of the statistically best "hits" that have functional annotation of their own. 

5. Infer that die function of the target protein is the same as the function of the best protein hit 

S Those annotating genomic sequences using this recipe recognize several obvious limitations to the 

5 pairwise analysis approach to annotation. Commonly encountered problems include: 

g (a) The BLAST server retums no hits. 

Si (b) The BLAST server retums sequences of possible homologs, but with similarity scores too low to be 
;i certain that the sequence found is indeed a homolog. 

r (c) The homologous sequences that the server retums have no annotation indicating function. 

u 
m 

\1 The consequences of these problems are mentioned in most contemporary reviews of genomics, 
g Because of the limitations of this recipe, some 40% (depending on the details of the homology search) of 
y the proteins in a typical genomic sequence have not been reUably assigned any function. It is well 
recognized that this arises because as powerful as it is, BLAST cannot detect homologs after their 
sequences have diverged far into the "twilight zone" of sequence similarity, defined (arbitrarily) by 
Doolittle as less than 20% sequence identity [Doo87]. For this reason, tools that detect more distant 
homology are being actively sought. 

To solve this "no interesting hit" problem, many workers have attempted to extend the power of the 
pairwise alignment tool to detect increasingly distant protein sequences. Pearson for example recently 
developed statistical parameters from the distribution of similarity scores from thousands of unrelated 
sequences to gain an estimate of the statistical significance that can be used to infer sequence homology 
[Pea98]. BLAST has been improved by transformation into PSI-BLAST [Alt97]. 

Other approaches have been based on the fact that proteins diverging under functional constraint can 
retain their core folded structure long after sequence similarity has vanished [Ros75]. This makes possible 
die detection of distant homology by comparison of predicted structures. In Serial No. 07/857,224, filed 
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March 25, 1992, and issued as US Patent 05958784 on September, 28, 1999, a partent of this appUcation, 
we disclosed the first useful tools to apply this approach. Many others have foUowed our lead, including a 
particularly interesting study by Barton and his coworkers [Rus96], and we have recently published a 
review article showing various successes of this approach [Ben97]. 

Through their ability to detect distant homologs of proteins whose functions are known, the tools 
disclosed in Serial No. 07/857,224 have proven to be quite useful because they generate hypotheses about 
proteins with unknown function. For example, these tools were applied to the heat shock protein 90 
(HSP90) family, for which no member had an assigned function. A model for the conformation of the 
protein was built for HSP90 as part of the CASP2 prediction contest [Ger97]. The conformation model 
was recognizably simHar to the fold for the N-terminal ATP binding domain of gyrase. This generated the 
hypotiiesis that HSP90 and gyrase were distant homologs. This generated the hypothesis that HSP90 
bound ATP as it contributed to ttie fitness of its host organisms. This hypothesis contradicted 
experimental papers avaUable at tiie time claimed that HSP90 did not bind ATP [Jac96]). The prediction 
p., was correct; tiie experimental papers were incorrect [Pro97]. And tiiis success was recognized both by the 
2 CASP2 judges and die group tiiat solved tiie crystal stmcture of HSP90, who wrote: 

=F "The tertiary fold of Hsp90 N-domain has a remarkable and totally unexpected similarity to the N-terminal ATP- 

S binding fragment of ... DNA gyrase B protein. This similarity was not initially recognized by the authors of either 

M the human or yeast structures but was determined [by Gerloff and Benner] within the CASP2 structure prediction 

□ competition. Our observation of specific ADP/ATP binding to Hsp90 completely contradicts the careful and widely 

^ accepted biochemical analysis of Jakob et al. (1996) who demonstrated that Hsp90 could not be photolabelled with 8- 

~ azidoATP, was not retained on C8 agarose, and did not enhance the fluorescence of MABA-ADP." [Pro97] 

F A year later, the tools disclosed in Serial No. 07/857,224 were used to analyze functional and 
5 structural relationships for ribonucleotide reductase [Tau97]. Other examples in our laboratory and 
Q elsewhere are summarized in [Ben97]. These examples showed how very distant homolog detection had 
^ utility in annotating gene sequences, simply by generating experimentally testable hypotheses concerning 
their physiological function. 

The pairwise aUgnment approach to annotation has become dominant in contemporary genomics. 
Indeed, many patent applications for individual gene sequences are being submitted to the PTO where the 
sole argument for utility is based on the statement that the gene being patented is homologous to gene X, 
that homologous genes have analogous functions, and that this implies that the gene being patented and 
gene X have the same function. The logic here is outlined below: 

(a) Sequence similarity between two proteins implies that the two proteins are homologous (share 
common ancestry). 

(b) Homologous proteins have analogous conformations (folds). 

(c) Analogous folds implies analogous behaviors (what is measured in the laboratory). Thus, 
homologous proteins bind analogous ligands, catalyze analogous reactions, and have analogous physical 
properties. 
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(d) Homologous proteins have analogous function, that is, they contribute to survival in analogous 
ways. 

These assumptions constimte the "homology implies analogous function" (HIAF) logic. It is widely 
regard as the foundation for amiotation (see, for example, Skohiik's contribution in this area [Fet98]). 

Element (a) has sound statistical basis, at least within the context of a particular evolutionary theory. 
Element (b) is known from empirical analysis to be generally true, provided that the two proteins have 
diverged under functional constraints. Elements (c) and (d) are, however, are certamly not universaUy true, 
and may not be true in general. Frequently, new function is generated in biological systems by 
recruitment of a protein that performs a different function. 

Already in 1988, well before the age of the genome, it was weU known that the logic otiined above was 
not reUable, because of frequent recruitment in the biological worid. The AppUcant himself reviewed 
examples where conventional logic would provide deceptive annotation [Ben88]. Four examples illustrate 
how severely elements (c) and (d) of die logic can be violated. 

The first is chosen fi-om eubacterial enzymology, and relates to tiiree enzymes playing three distinct 
roles in microbial metaboUsm, fumarase (in the citric acid cycle), aspartase (involved in amino acid 
degradation), and adenylosuccinate lyase (essential for nucleic acid biosynthesis). The tiiree proteins are 
clearly recognizable as homologs. Their sequences share statistically significant similarity, as ilUustrated 
for the following tiiree excerpts: 

LPEJSEKSSSIMPGKVNPTQC fiimarase 

LPELQAGSSIMPAKVNPVVP aqaitase 

FEKDQIGSSAMPYKRNPMEIS adenylosuccinate lyase 

The overaU folded form of the tiiree proteins is the same (an 8-fold alpha-beta barrel). They catalyze 
reactions tiiat, at least at a mechanistic level, have some degree of analogy. From a biologist's perspective, 
however, ttiey have very different fiinctions. The HIAF logic used by virmally every genome annotation 
tool would be deceived by tiiis family. 

The second example is firom metazoan biology, and involves tiie family of proteins known as the src 
homology 2 (SH2) domains. SH2 domains are clearly all homologous. The proteins all have analogous 
folds and analogous behaviors; tiiey all bind to a polypeptide tiiat carries a phosphotyrosine. But tiiey 
bmd different peptide sequences flanking tiie phosphotyrosine. For this reason, they have very differem 
fiinctions, as the biologist defines it (and as it is defined tiiroughout this proposal, the behavior that 
confers survival value). Some SH2 domains are in viruses, and regulate viral growtii. Some participate in 
the immune response. Otiiers are involved in tiie regulation of division of non-immune cells. For virtually 
any practical purpose (pharmaceutical target identification, for example), tiie analogies between die 
behaviors of different SH2 domains are less important tiian die differences in tiieir fiinction. 
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The third example used covarion behavior to detect changing function m elongation factors, proteins 
whose sequences are so highly conserved that there is litUe difficulty in recognizing homologs, even in the 
three kingdoms of life. If any protein "has the same function" m different species in orthologous form, it 
is elongation factors, we reasoned. Even so, covarion behavior clearly showed that different (presumably) 
orthologous forms of the protein had subtly different behaviors that indicate subtle differences in 
function. Coupling the evolutionary insights based on a sophisticated, matiiematicaUy detailed, 
evolutionary analysis to structural biology even identified residues that were important for these 
differences. 

The fourth compares protein serine kinases and protein tyrosine kinases. These families are clearly 
homologous, the latter having been recruited firom the former ca. 600 miUion years ago. The chemist 
would say that botii classes of enzyme operate via analogous reaction mechanisms, differing only in the 
source of tiie oxygen nucleophile in tiie phosphoryl ti-ansfer reaction. The biologist would note, however, 
that tiie physiological fiinction of tiie two classes of proteins are greatly different. For any biomedical 
application, tiie biologist would be correct. The physiologically relevant differences in behavior, central to 

i die understanding of biological fiinction (phosphorylation on tyrosine versus phosphorylation on serine) 

m cannot be inferred for one family firom tiie otiier using the conventional logic. 

i If proteins witii recognizable (indeed, often high) sequence simUarity can have different fiinctions, the 
H a fortiori argument tcan be made that surely recruitment is possible for protein homologs witti marginally 
Q significant sequence similarities. This argument suggest tiiat tiie focus of efforts, includmg those 
f disclosed in Serial No. 07/857,224, might add only some to our ability to provide reliable annotation. 
□ In reaUty, botii approaches are needed. This application focuses on tools to detect and/or rule out 
g recruitinent within a protein family. We expect tiiese tools to become increasingly more important as the 
m genomes of metazoans are sequenced. It is now clear tiiat the last 500 million years of molecular evolution 
Q in higher organisms has involved repeated recruitment of existing folds to perform new functions. 

STTMMARY OF THH INVENTION 
As discussed in Serial No. 08/914,375, the parent for the instant application, tiie physiological 
fiinction of a biomolecule is ultimately determined by die contribution that tiie biomolecule makes to the 
efforts of tiie host organism to survive, select a mate (in higher organisms), and reproduce. Determining 
the physiological fiinction of a protein is not trivial, however. Difficulties in estabUshing physiological 
fiinction are discussed at lengtii by Benner and EUington [Benner, S. A., EUington, A. D. Interpreting tiie 
behavior of enzymes. Purpose or pedigree? CRC Crit. Rev. Biochem. 23, 369-426 (1988)]. Still more 
difficult is identifying which behaviors of a protein as measured in vitro are relevant for physiological 
function in vivo. Nevertiieless, tiie identification is important. In vitro behaviors tiiat have relevance to 
physiological fiinction in vivo are tiiose tiiat are interesting to study for biotechnological, biomedical, or 
otiier appUcations. There is at present in the art no general metiiod for detennining what in vitro behaviors 
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are relevant to in vivo function. Processes for determining these behaviors were claimed in the parent 
application (Serial No. 08/914,375). 

A method for making a model for the folded structure of a set of proteins from an evolutionary 
analysis of a set of ahgned homologous protein sequences was claimed in Serial No. 07/857,224. The 
instant application concerns metiiods for using tiiese models. The first metiiod is used to confirm or deny 
a hypothesis tiiat two proteins are homologous, and is comprised of comparing a predicted structure 
model for one family of proteins with a predicted structure model for a second family of proteins, or an 
experimental structure for die second family, and deducing the presence or absence of homology based on 
tile presence or absence of structural similarity flanking key residues in tiie polypeptide sequence. The 
second mefliod identifies mutations during tiie divergent evolution of a protein sequence tiiat are 
potentially adaptive by identifying episodes during tiie divergent evolution of a family of protems where 
tiiere is a high absolute rate of amino acid substitution, or a high ratio of non-silent substitutions to non- 
silent substitutions. Amino acids that are changing during tiiis episode are likely to be adaptive. The third 
_ is a metiiod for identifying specific in vitro properties of ttie protein tiiat are likely to play a physiological 
S role in vivo in an organism. This metiiods involves synthesizing in tiie laboratory proteins having the 
m reconstructed amino acid sequences of a protein before and after a period of rapid sequence evolution that 
£ characterizes adaptive substitution, measuring the in vitro properties of tiie protein before the episode of 

rapid sequence evolution, and then measuring the in vivo properties of the protein after the episode of 
Q rapid sequence evolution. The in vitro behaviors tiiat remained unchanged through this episode are not 
T likely to have adaptive significance physiologically. The in vitro behaviors that changed through this 
Q episode are likely to have adaptive significance physiologically. The fourth concerns method for 
2 organizing genome sized sequence databases. 



BRIEF DESCRIPTION OF THE DRAWINGS 

Drawing 1. The three elements modeling the evolutionary history of the leptins, proteins from the 
"obesity gene" identified by genetics experiments in mice. Homologs are found in other mammals 
(including human), (a) An evolutionary tree showing the pedigree of each leptin family member, (b) A 
part of the multiple alignment, showing the genetic relationship of amino acids in the protein sequence. 
The reconstructed ancestral sequence from the (now extinct) ancestor of humans, rodents, and ruminants 
(marked "X") is shown in the alignment. The sequence as shown here is deterministic; in the work to be 
performed here, the ancestral sequences are all probabilistic (see text) 

080 090 100 110 120 

. 1 . I . I • I • I 
RNVIQISNDLENLRDLLHVLAFSKSCHLPWASGLETLDSLGGVLEASGYS human 

RNMIQISNDLENLRDLLHVLAFSKSCHLPWASGLETLDSLGGVLEASGYS chimp 

RNMIQISNDLENLRDLLHVLAFSKSCHLPWASGLETLDSLGGVLEASGYS gorilla 

RNVIQISNDLENLRDLLHVLAFSKSCHLPWASGLETLDRLGGVLEASGYS orangutan 

RNVIQISNDLENLRDLLHLLAFSKSCHLPLASGLETLESLGDVLEASLYS rhesus 

QNVLQIAHDLENLRDLLHLLAFSKSCSLPQTRGLQKPESLDGVLEASLYS rat 
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QNVLQIAHDLENLRDLLHLLAFSKSCSLPQTRGLQKPESLDGVLEASLYS rat 

QNVLQIANDLENLRDLLHLLAFSKSCSLPQTSGLQKPESLDGVLEASLYS mouse 

RNVIQISNDLENLRDLLHLLASSKSCPLPQARGLETLESLGGVLEASLYS ancestor X 

RNVIQISNDLENLRDLLHLLASSKSCPLPQARALETLESLGGVLEASLYS pig 

RNVIQISNDLENLRDLLHLLAASKSCPLPQVRALESLESLGWLEASLYS sheep 

RNWQISNDLENLRDLLHLLAASKSCPLPQVRALESLESLGWLEASLYS ox 

RNWQISNDLENLRDLLHLLASSKSCPLPRARGLETFESLGGVLEASLYS dog 



Drawing 1. xxx Evolutionary tree showing the evolutionary history of the leptins. Heavy hnes show 
branches with expressed/silent ratios higher than 2. Hatched lines show branches with expressed/silent 
ratios from 1 to 2. Dotted lines show branches with expressed/silent ratios less than 1, or indeterminate. 
Numbers on the lines indicate the ratio of expressed/silent changes for that branch. An "x" at the end of a 
branch signifies that a sequence for the protein is available in the database. Accoprding to the method of 
the instant invention, a correlation between the episode of high sequence evolution and the 
S Drawing 2. Evolutionary tree showing the evolutionary history of the leptin receptors. Heavy lines 
S show branches with expressed/silent ratios higher than 2. Hatched lines show branches with 
^ expressed/silent ratios from 1 to 2. Dotted lines show branches with expressed/silent ratios less than 1, or 
"1 indeterminate. Numbers on the Unes indicate the ratio of expressed/silent changes for that branch. An "x" 
Q at the end of a branch signifies that a sequence for the protein is available in the database. 

5 

p Drawing 3. An example of homoplasy taken from the evolution of alcohol dehydrogenase from yeast 
ffl (position 30). At at least three points in the tree, a P->A substitution occurred independently. 

g 

M Drawing 4. For 17 vertebrate aromatases, an unrooted evolutionary tree built by a Darwin-based 

^ (Gonnet & Benner, 1991) based on an analysis of amino acid sequences. Numbers on the branches are 
the Ka/Ks ratios evaluated using the methods of Fitch (1971) to reconstruct intermediate evolutionary 
states and Li et al. (1985). The key is given below, togetherwith the multiple sequence alignment used to 
calculate the tree. 



1 . Tilapia nilotica (rainbow trout), GenBank gl613859, mRNA (Chang et al., 1997) 

2. Oryzias latipes (medaka), GenBank gl786171, ovarian follicle mRNA (Tanaka et al., 1995) 

3 . Danio rerio (zebrafish), GenBank g2306966 aromatase mRNA 

4. Carassius auratus (goldfish) ovary, GenBank g2662330, ovarian mRNA 

5. Ictalurus punctatus (channel catfish), GenBank g9 12802 (Trant, 1994) 

6. Carassius auratus (goldfish) brain, GenBank g2662328, brain mRNA 

7. Sus scrofa (pig) placental, isoform 2, GenBank gl762232, mRNA (Choi et al., 1997a) 

8. Sus scrofa (pig) embryo, isoform 3, GenBank gl244543, mRNA (Choi et al., 1996) 

9. Sus scrofa (pig) ovary, isoform 1, GenBank gl928957, mRNA (Conley et al., 1997) 

10. Bos taurus (ox), GenBank g665546, mRNA (Hinshelwood et al., 1993) 

1 1 . Equus caballus (horse), GenBank g2921277, mRNA (Boerboom et al. 1997) 

12. Mus musculus (mouse), GenBank g3046857, mRNA (Terashima et al. 1991) 

13. Rattus norvegicus (rat), GenBank g203804, mRNA (Hickey et al„ 1990) 
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14. Oryctolagus cuniculus (rabbit), GenBank gl240042, mRNA (Delanie et al, 1996) 

15. Homo sapiens (human), GenBank g28846, mRNA (Harada, 1988) 

16. Gallus gallus (chicken), GenBank g21 1703 (McPhaul et al., 1988) 

17. Poephila guttata (zebra finch), GenBank g926845, ovary mRNA (Shen et al., 1994) 

010 020 030 040 050 060 070 080 

I I I I I I I I 

1 MVLEMLNPMHYKVTSMVSEWPFASIA^ 

2 MFLEMIOTMQYWrilOTETVWSAMPIJXI^ 

3 MILEMLNPMHYNLTSMVPEVMPVATLPII^ 

4 VLELLMQGAHNSSYGAQDNVCGAMATLJuI^^ 

5 -MEE\nJCGTVNFAATVQVTmALTGTIJjLILI^ 

6 VVDIJjIQRAHNGTERAQDNACX3ATATIIJJXIX:IJJ^ 

7 MVLEMLNPMYYKITSMVSE\A^FASIAVIJ^ 

8 LVSIAPimVGLP-SGIimTRSLIIiVCLm^SHSE^ 

9 MVLEbILNPMN--ISSMVSEAVLFGSIAII^ 

10 MVLEMLNFPMHFNITTMWAAMPAATMPILI^ 

11 VMEILIJ^EAI^TDPRYENPRG-ITIJ^IJJXLV^ 

12 MVLETLNPLHYNITSLWDITy^^ 

13 WARSLCDLKCHPirxSISMATRTLIIXVCLIiVAW 

14 MLLEVLNPRHYNVTSMVSEVVPIASIAILLLTC 

15 MVLEMI^IHYNITSIWEAMPAATMPVL^ 

16 MPVATTOIIILICFLFLIWNHEETS-SIPGPGYCMGIGPLISHG^ 

^ 17 MFLEMU^PMHYNVTIMVPETVPVSA^ 

S 090 100 110 120 130 140 150 160 

m I I I I I I I I 

£ 1 TYGEFIRVWIGGEETLIISKSSSVFHVMKHSHYTSRFGSKPGLQFIGMHE^ 
f~ 2 HYGEFMRWISGEETLIISKSSSMFHVMKHSHYISRFGSKRGLQCIG^IH^ 
H 3 MYGEF\mWI93EETLVISKSSSTFHIMKHDHYSSRFGSTFGLQYMGMHEl^ 
Nl 4 KYGDIWWINGEETLILSRSSAVYHVIJIKSLYTSRFGSKLGLQCI 
D 5 KYGSIARWISGEETFILSKSSAVYHVLKSNNYTGRFASKKGLQCIGM^^ 
if? 6 KYGDIVRWINGEETLILSRSSAVyHVLRKSLYTSRFGSKLGLQCIGM^ 

7 MYGEFMRWIGGEETLIISKSSSVFHVMKHSHYTSRFGSKPGLECIGM^ 

8 KYGDIVRWINGEETLILSI^AVHHVLKNRKYTSRrcSKQGLSCIG^ 

M 9 MYGEFMRWIGGEETLIISKSSSIFHIMKHKHYTCRFGSKIXSLECIGMHEKGIM^ 
^ 10 I^GEFIRWICGEETLIISKSSSMFHVMKHSHWSRFGSKPGLQCIGMHENGI 

11 KYGDMVRWISGEETLVLSRPSAVYHVIiKHSQYTSRFGSKI^ 
k= 12 TYGDFVRWISGEETFIISKSSSVSHVMKHWHWSRFGSKLGLQCIGM^ 

13 KYGDIVRWINGEETLILSRSSAVHHVLKiraOTSRFGSIQGLS^^^ 
□ 14 MYGEFMRVWVCGEETLIISKSSSMFHVMKHSHYISRFGSKI/5L^^ 
O 15 WGEFMRVWISGEETLIISKSSSMFHIMKHNHYSSRFGSKLGIjQ^ 

"*~ 16 TYGEFVRWISGEETFIISKSSSVFHVMKHWNWSRFGSKLGLQCIGM^^ 
17 MYGEFMRWISGEETLIISKSSSMVHVMKHSNYISRFGSKRGI^IGM^^ 

170 180 190 200 210 220 230 240 

I I I I I I I I 

1 M\m^CADSITKHLDKLEEVRISIDI/3YVDVLT^ 

2 MVEVa/ESIKQHLDRLGEVTOTSGOTIOTjTLjMRHIMLOT 

3 MVTVCVESVNNHLDRIX>EVT1SIAL(^^ 

4 TIjEICITSTNTHIJDNLSHLMDARGQVDIL^^ 

5 SVCVCVSATNKQLNVLQEFTDHSGHVD^^ 

6 TMEICTTSTNSHUDDLSQLTDAQGQIJ^ILNIjI^ 

7 M\m^(:MSITKHI^KLEEVRNDIiGYVD^ 

8 TVEVCVTSTQTHLDNLSSL SYVDVIXSFLRCTVVDISNRLFLGVFVDEKEI^ 

9 MVWCADSITKHIJ:)KIiEEVRNDIX3nfVD^ 

10 MVAICVGSIGRHIJDKI^EVTTRSGCVDVLTLM^ 

11 TLEICTMST^ITHLlDGLSRLTnAQGHVD\^^ 

12 MIAICVESTTEHIiDHLQEVTTELGNINALNIiM^ 

13 TVDVCVSSIQAHLDHLDSL GHVIJVLbJLLRCTVIJD 

14 M\n*ICADSITKHLDRLEEV^CMDIiGWDVLT]^^ 

15 MVTVCAESLKTHIJ)RLEEVTI^SGYVD\nj 

16 MIAIC:VESTIVHU)KIiEEVTTEVC3WNV^ 

17 MV^CVESIKQHLDRLGDVTDNSGOTIW^^^^RHIML^ 
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250 260 270 280 290 300 310 320 

I I I I I I I 



1 WLYRKYEKSVKDIiCEDMEILIEKKRR^ 

2 WLYRKYERSVKDIiOEIAVLVEKKRHKVSTAEKI^^ 

3 WLSRKHQKSIKEIJ^DAVGIU^EEKRHRIFTAEKI^HVDFATO^ 

4 WIHGKHKI^AQELQDAIAALIEQKRVQLTRAEKFDQ-U^FTCE^ 

5 F\A^KKYHIAAKEWDmGKLVEQKRQAIN^^ 

6 wmRKHKimAQELQIMTALIEQKKVQLAH^^ 

7 WLYKKHKESVKDIiCENMEILIEKKRCSIITAEKI^^ 

8 WIH0I^AAQELQDAIESLVERKRKEMEK3AEKLDN-I^ 

9 WLYRKYEKSVKDIiCDAMEILIEEKRHRISTAEKI^SMDF^ 

10 VE.YKKYEKSVKDIJa3MDILVEKKRRRISTAEK^ 

11 wiiTOKHimAQEIJroAIEDLIEQKRTELQQAEKI^ 

12 WIXrKKYKDAVKDIiCGAMEILIEQKRQKLSTVEK^ 

13 wiHHRHKTATQELQDAIKiy.VDQKRKNMEX3ADKI^ 

14 WIXTRKYEKSVKDLKDAMEILIAEKRHRISTAE^ 

15 WLYKKYEKSVKDIiODAIEVLIAEKRRRISTEEKI^ 

16 WLOOCfEEAAKDIJCGAMEILIEQKRQKLSTVEK^ 

17 vfl.YRKYERSVKDIja)EIEILVEKKRQKVSSAEKI£^ 

330 340 350 360 370 380 390 400 

I I I I I I I ' 

1 IJI^IAKHPQVEEELMKEIOTWGERDIRNDDMQKLEV^^ 

2 UiVAEYPEVEAAII^IHIWGDRDIKIEDIQNIJCVVENFIN^ 

3 i/n.IAQHPKVEEAmKEI(?IVLGERDIiaro 

4 UiMQNPDVEl^II^EMNAVIAGRSI^HSHLSGI^ 

5 UilJCQNSVVEEQIVQEIQSQIGERDVESADLQKIJm^ 

6 MJ^QNPDVELKII^EMDSVIAGQSI^HSHLSKI^ 

7 iJXIAKHPQVEEAIVKEIQTVIGERDIiySIDDMQKIJ^^ 

8 UiI^QNPHVELQIA^EIiyriVGDSQI^NQDLQKL^VI£ 

9 IJ1,IANHPQVEEEI.MKEIYTWGERDIRM)DMQKIJC^^ 

10 LJ^IAKHPSVEEAIMEEI^?IWGERDIRIDDIQKIJCWE^ 

11 LilJ^QNAEVERRILTEIHTVUSOTELQHSHLSQIJ^ 

12 LILIAEHPTVEEEMMREIETIWGDRDIQSDDMPNIiCIVENFIY^ 

13 UJJ^QNPHVEPQU^EIDAWGERQL^NQDLHKLQVMES^ 

14 I^IAKHPQVEEAIIREI<;?IWGERDIRIDDMQKLKV^^ 

15 I^IAKHPNVEEAIIKEIC?IVIGERDIKIDDIQKLKVMEN^ 

^ 16 LILIADDPTVEEKMI^IETVMGDREVQSDDMPNIJCIVENFIYESM^ 
ffl 17 liLIAEYPEVETAIU^EIHTWGDRDIRIGDVQNIJCWENFI^ 

S 410 420 430 440 450 460 470 480 

^ I I I I 1 I I I 

1 GRMHRI^PKPNEFTI^AKNVPYR-YFQPFGro^ 

2 GRMHRI^PKPNEFTLENFEKNVPYR-YFQPFGrc 

3 GRMHKI£FFPKP^^EFTLENFEKNVPYR-Y^ 

4 GRMHRSEFFPKPNEFSLJDNFQKNVPSR-FFQPFGSGPRSCVGKHIAMVM^ 

5 GRMHKSEFFQKPNEFNLENFE^m/PSR-YFQPFG(:x;P 

6 GRMHRSEFFSKPNQFSI£)NFHK1WPSR-FFQPFGSGPRSCTGKHI^^ 

7 GRMHRI^PKPNEPI1^AKNVPYR-YFQPPGFX;PRACAGKYI^^ 

8 GRMHRTEFFHKANEFSI^QKOTPRR-YFQPFGSGPRACVGRHIAMV^^ 

9 GRMHRI^PKPNEFTI^AKNVPYR-YFQPPGFGPRAC^^ 

10 GRMHRI£FFPKPNEF^LE^3FAKNVPYR-YFQPFGFGP 

11 GRMHRSEFYPKPADFSII)NFNKPVPSR-FFQPFGSGPRSC:VGKHIAMV^^ 

12 GRMHKI£FFPKPNEFSLENFEKNVPSR-YFQPFGFGPRSOTGKFI^^ 

13 GRMHRTEFFIJCGNQFNLEHFENNVPRPPTFQPFG^ 

14 GRMHRLEFFPKPNEFTLENFAKNVPYR-YFQPFGFGPRGC^ 

15 GRMHRI^PKPNEFTLENFAKNVPYR-YI^PFGFGPRGC^ 

16 GRMHKIJE2T^PKPNEFSLENFEKNVPSR-Y^ 

17 GRMHRLEYFPKPNEFTI^IQFEKNVPYR-YF^ 

490 
I 

1 SLHPDETSG 
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2 SLHPNEDRH 

3 AliHPDESRS 

4 SQQPVEEPS 

5 SMQPVEEDP 

6 SQQPVEEPS 

7 SLHPDETSG 

8 SQQPVEHHQ 

9 SLHPHETSG 

10 SIiHPDETSD 

11 SQQPVEDKH 

12 SMHPIERQP 

13 SQQPVEHQQ 

14 SLHPDETRD 

15 SLHPDETKN 

16 SMHPIERQP 

17 SLHLDEDSP 



Drawing 5. An evolutionary tree built from neutral evolutionary distances (NEDs) calculated by 
assuming a first order approach to equilibrium for codon usage at two fold redundant silent sites. 
Numbers on branches of the tree correspond to evolutionary time (in million years) estimated from the 
NEDs using a first order rate constant for pyrimidine-pyrimidine transitions of 3 x 10-^ changes per base 
per year. 

Drawing 6. The Notch family, with f2 values for each of the internal nodes, and Ka/Ks values for each of 
the branches. 

DETAILED DESCMPTION OF THE INVENTION 

This disclosure describes the classes of tools that permit the scientist to generate experimentally 
testable hypotheses concerning the function of a protein starting from an evolutionary analysis. These are 
outlined below: 

I. Tools that detect change in function within a family of proteins. 

A. Ratios of silent to non-silent substimtion along specific branches of an evolutionary tree 
including tools that address normaUzation issues. 

B. Covarion behavior, in which individual residues display different mutability in different 
branches of a tree. 

C. Detecting high absolute rates of amino acid substitution, changes per unit time, 
n. Tools that detect conservation of function within a family of proteins. 

A Compensatory changes 

B. Homoplasy 

C. Absolute conservation within a defined evolutionary distance 

m. Tools that identify individual residues involved in changes in functionally significant behavior. 

A. Residues changing in episodes with high Ka/Ks values, minus residues changing in episodes 
with low Ka/Ks values 
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B . Residues displaying covarion behavior 

C. Mapping these residues on to models for die secondary, tertiary, and quaternary structure of 
proteins. 

rv. Tools tiiat identify individual residues involved in conserved of functionally significant behavior 

A. Residues suffering compensatory changes 

B. Residues displaying homoplasy 

C. Mapping these residues on to models for the secondary, tertiary, and quaternary structure of 
proteins. 

V. Tools that involve correlation between the evolutionary histories of two famihes of proteins 

A. Correlating the topology of evolutionary trees in two families of proteins 

B. Correlating the connectivity of proteins in a gene family 

C. Dating events in the molecular history 

D. Correlating evolutionary events in two protein families occuring at approximately the same 
time 

E. Correlating evolutionary events in two protein families that are associated with analogous 
behavior involving expressed/silent ratios 

VI. Tools that involve correlation between the evolutionary history of a family of proteins and the 
evolutionary history of the organism as known from some source other than genomic sequence data, 
including paleontology, geology, ecology, ontogeny, phylogeny, or systematics (collectively known as 
the "non-genomic record". 

A. Correlating the topology of an evolutionary trees and the non-genomic record. 

B. Correlating features of patterns of evolution in specific branches in the evolutionary tree with 
the non-genomic record 

C. Correlating evolutionary events in several protein famihes occuring at approximately the same 
time with the non-genomic record 

Many of these tools are new in this disclosure. Others were disclosed in Serial No. 07/857,224 and 
Serial No. 08/914,375 and are claimed here for the first time. Li many cases, elements of novelty and 
utility can be found by combining these tools. This disclosure will systematically indicate the AppUcant's 
presently preferred combinations, with statements of where the Apphcant beheves that the state of the prior 
art requires reference to the priority dates of parent applications, where it does not. 

All of the tools have in common the same starting point, a basic evolutionary model based on three 
parts: 

(a) An evolutionary tree that shows the famiUal relationship between the members of the protein family, 

(b) A multiple ahgnment of the sequences of members of tiie protein family, which shows the 
evolutionary relationship between the individual amino acids in the sequences, and 

(c) The sequences of ancient proteins that were the ancestors of the contemporary proteins in the family. 
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Each element of an evolutionary model requires the other two in the reconstruction process. 
Accordingly, processes for constructing an evolutionary model for a protein family are frequently iterative. 
These processes are well know in the art, and include parsimony tools [Fit67], maximum likelihood tools 
[Gon91][Gon96][Tho92], tools for evaluating the probability of an evolutionary model [Gon96], and 
gamma models [Swo96] [Li97]. 

Serial No. 08/914,375 disclosed the step-by-step procedure in which the basic evolutionary model 
for a family of proteins is constructed to support the tools outlined above. 

(a) A multiple aUgnment, an evolutionary tree, and ancestral sequences at nodes in the tree are 
constructed by methods well known in the art for a set of homologous proteins. These three elements of 
the description are interlocking, as is well known in the art. The presently preferred methods of 
constructing ancestral sequences for a given tree is the maximum parsimony methods, as implemented 
(for example) in the commercially available program MacClade [W. P. Maddison, D. R. Maddison, 
MacClade, Analysis of Phytogeny and Character Evolution, Sinauer Associates, Sunderland MA (1992)]. 
Alternative methods for reconstructing evolutionary intermediates can now be found with the PAUP 
program [Swofford, D. L., Olsen, G. J., Waddell, P. J., & Hillis, D. M. (1996) Phylogenetic Inference in 
Molecular Systematics (eds. Hillis, D. M., Moritz, C. & Mable, B. K.) 407-514 (Sinauer Assc, Inc., 
Sunderland, MA, 1996)] and using the maximum likelihood method of the PAML program [Yang, Z. H. 
PAML: a program package for phylogenetic analysis by maximum likelihood. Comput. AppL BioscL 13, 
555-556 (1997)]. Trees are compared based on their scores using either maximum parsimony or 
maximum likelihood criteria, and selected based on considerations of score and correspondence to known 
facts. Step (a) is part of the process used to generate the predictions of secondary structure using the 
method disclosed in Serial No. 07/857,224. 

(b) A corresponding multiple alignment is constructed by methods well known in the art for the 
DNA sequences that encode the proteins in the protein family. The multiple alignment is constructed in 
parallel with the protein alignment. In regions of gaps or ambiguities, the amino acid sequence aUgnment 
can be adjusted to give the alignment with the most parsimonious DNA tree. The presently preferred 
method of constructing ancestral DNA sequences for a given tree is the maximum parsimony method. 
The DNA and protein trees and multiple alignments must be congruent, meaning that when amino acids 
are aligned in the protein alignment, the corresponding codons are aUgned in the DNA alignment. 
Likewise, the connectivity of the two evolutionary trees must show the same evolutionary relationships. In 
regions where the connectivity of the amino acid tree is not uniquely defined by the amino acid sequences, 
the tree that gives the most parsimonious DNA tree is used to decide between two trees or reconstructions 
of equal value. Finally, the ancestral amino acids reconstructed at nodes in the tree must correspond to the 
reconstructed codons at those nodes. When the ancestral sequences are ambiguous, and where the DNA 
sequences cannot resolve the ambiguity, the reconstructed DNA sequences must be ambiguous in parallel. 
Approximate reconstructions are valuable even when exact reconstructions are not possible from available 
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data, and the tree is preferably constrained to correspond to evolutionary relationships between proteins 
inferred from biological data (e.g., cladistics). 

(c) Mutations in the DNA sequences are then assigned to each branch of the DNA evolutionary tree. 
These may be fractional mutations to reflect ambiguities in the sequences at the nodes of the tree. When 
ambiguities are encountered, alternatives are weighted equally. Mutations along each branch are then 
assigned as being "silent", meaning that they do not have an impact on the encoded protein sequence, and 
"expressed", meaning that they do have an impact on the encoded protein sequence. Fractional 
assignments are made in the case of ambiguities in the reconstructed sequences at nodes in a tree. 

As disclosed in Serial No 08/914,375, the quality of a multiple aUgnment and the precision of the 
reconstructed ancestral sequences decreases if proteins are included in the family with sequences 
diverging by over 150 PAM units, where a PAM unit is the number of point accepted mutations per 100 
amino acids. For this reason, families are most preferably constructed with a tree "width" (the distance 
between the two most divergent proteins in the family) of 150 PAM units or less. Some variation is, of 
course, desired. Therefore, the PAM width of the tree is preferably more than 50 PAM units. Also 

^ referred are well articulated trees. In principle, the more sequences in the tree, the more valuable an 

-y1 evolutionary analysis of the tree becomes. 

J With the emergence of massive amounts of sequence information as a result of genome projects, the 

ability to construct detailed evolutionary histories of protein famiUes will increase. This will make the 
P inventions disclosed herein of still greater value, as is appreciated by one of ordinary skill in the art. 

One key inventive feature of Serial No 07/857,224 was that an evolutionary analysis had additional 

O value when placed within well defined. One key inventive feature of Serial No 08/914,375 was that an 

m 

f: evolutionary analysis gained additional value when it involved analysis of expUcitly reconstructed 

intermediates in the evolutionary tree. These inventive concepts are at the core of all of the tools outiined 

Q above. 

□ 

Another key inventive feature of Serial No 08/914,375 was that an evolutionary analysis gained 
additional value when it is correlated with the non-genomic record. This inventive concepts is at the core 
of all of the tools in class VI outlined above. 

Another key inventive feature of Serial No 08/914,375 involved the use of a natural organization to 
generate a rapidly searchable database. As disclosed in the specification to Serial No 08/914,375, when all 
of the genomes of all of the organisms on planet Earth are completed, all protein sequences will be easily 
recognizable as members of one of ca. 10,0(X)- 100,000 nuclear families, protein sequence modules 50- 
500 amino acids long that are related by conmion ancestry. This conclusion reflects the well known fact 
that all organisms on the planet are descendants of a single ancestor. In the course of producing the 
diversity of organisms now on Earth, divergent evolution also produced the diversity of molecular genetic 
sequences within nuclear famiUes. 
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As disclosed in the specification to Serial No 08/914,375, this permits a naturally organized database. 
The ancestral sequences and the predicted secondary structures associated with the families are surrogates 
for the sequences and structures of the individual proteins that are members of the family. The 
reconstructed ancestral sequence represents in a single sequence all of the sequences of the descendent 
proteins. The predicted secondary structure associated with the ancestral sequence represents in a single 
structural model all of the core secondary structural elements of the descendent proteins. Thus, the 
ancestral sequences can replace the descendent sequences, and the corresponding core secondary 
structural models can replace the secondary structures of the descendent proteins. 

This makes it possible to define two surrogate databases, one for the sequences, die other for 
secondary structures. The first surrogate database is the database that collects from each of the famihes of 
proteins in the databases a single ancestral sequence, at the point in the tree that most accurately 
approximates the root of the tree. If the root cannot be determined, the ancestral sequence chosen for the 
surrogate sequence database is near the center of mass of the tree. The second surrogate database is a 
database of the corresponding secondary structural elements. The surrogate databases are much smaller 
than the complete databases that contain the actual sequences or actual structures for each protein in the 
01 family, as each ancestral sequence represents many descendent proteins. Further, because there is a limited 
5 number of protein famihes on the planet, there is a limit to the size of the surrogate databases. Based on 
G our work with partial sequence databases [Gonnet et al., op. cit. 1992], we expect there to be fewer than 
D 10,000 famihes as defined by steps (a) through (e). 

y Searching the surrogate databases of the instant invention for homologs of a probe sequence thus 
Q proceeds in two steps. In the first, the probe sequence (or structure) is matched against the database of 
surrogate sequences (or structures). As there will be on the order of 10000 families of proteins as defined 
by steps (a) through (e) after all the genomes are sequenced for all of the organisms on earth, there will be 
only on the order of 10000 surrogate sequences to search. Thus, this search will be far more rapid than 
with the complete databases. A probe protein sequence (or DNA sequence in translated form) can be 
exhaustively matched [Gonnet et al., op. cit. 1992] against this surrogate database (that is, every 
subsequence of the probe sequence will be matched against every subsequence in the ancestral proteins) 
more rapidly than it could be matched against the complete database. 

Should the search yield a significant match, the probe sequence is identified as a member of one of the 
families already defined. The probe sequence is then matched with the members of this family to 
determine where it fits within the evolutionary tree defined by the family. The multiple ahgnment, 
evolutionary tree, predicted secondary structure and reconstructed ancestral sequences may be different 
once the new probe sequence is incorporated into the family. If so, the different multiple ahgnment, 
evolutionary tree, and predicted secondary structure are recorded, and the modified reconstructed ancestral 
sequence and structure are incorporated into their respective surrogate databases for future use. 
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The advantage of this data structure over those presently used is apparent. As presently organized, 
sequence and structure databases treat each entry as a distinct sequence. Each new sequence that is 
determined increases the size of the database that must be searched. The database will grow roughly 
linearly with the number of organismal genomes whose sequences are completed, and become 
increasingly more expensive to search. 

The surrogate database will not grow linearly. Most of the sequence families are already represented 
in the existing database. Addition of more sequences will therefore, in most cases, simply refine the 
ancestral sequences and associated structures. In any case, the total number of sequences and structures in 
their respective databases will not grow past ca. 100(X), the estimate for the total number of sequence 
families that will be identifiable after the genomes of all organisms on earth are sequenced. If a 
dramatically new class of organism is identified, this estimate may grow, but not exponentially (as is the 
growth of the present database). 

Since Serial No. 08/914,375 was filed, other databases have emerged that offer some precomputed 
famihes. Most noteworthy are Pfam [BatOO] and ProDom [CorOO]. 

Serial No 07/857,224 disclosed methods to identify residues, secondary structural elements, and 
evolutionary episodes that are involved in functional adaptation 

Further, during episodes of rapid sequence evolution, amino acid substitutions will be concentrated in 
secondary structural elements defined by the method claimed in Serial No. 07/857,224. These are 
secondary structural elements that are important in the acquisition of new function. A general method for 
identifying secondary structural elements that contribute to the origin of new biological function is 
comprised of identifying an element in the predicted secondary structure model where the corresponding 
section of the gene has a high ratio of expressed to silent changes. 

4. Identification of in vitro behaviors that contribute to physiological function. 

In vitro experiments in biological chemistry extract data on proteins and nucleic acids (for example) 
that are removed from their native environment, often in pure or purified states. While isolation and 
purification of molecules and molecular aggregates from biological systems is an essential part of 
contemporary biological research, the fact that the data are obtained in a non-native environment raises 
questions concerning their physiological relevance. Properties of biological systems determined in vitro 
need not correspond to those in vivo, and properties determined in vitro need have no biological relevance 
in vivo. 

To date, there has been no simple way to say whether or not biological behaviors are important 
physiologically to a host organism. Even in those cases where a relatively strong case can be made for 
physiological relevance (for example, for enzymes that catalyze steps in primary metabolism), it has 
proven to be difficult to decide whether individual properties of that enzymes (kcat» Km, kinetic order, 
stereospecificity, etc.) have physiological relevance. Especially difficult, however, is to ascertain which 
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behaviors measures in vitro play roles in "higher" function in metazoa, including development, regulation, 
reproduction, digestion. 

A general method to determine whether a behavior measured in vitro is important to the evolution of 
new physiological function is comprised of the following steps: 

(a) Prepare in the laboratory proteins that have the reconstructed sequences corresponding to the 
ancestral proteins before, during, and after the evolution of new biological function, as revealed by an 
episode of high expressed to silent ratio of substitution in a protein. This high ratio compels the 
conclusion that the protein itself serves a physiological role. 

(b) Measure in the laboratory the behavior in question in ancestral proteins before, during, and after 
the evolution of new biological function, as revealed by an episode of high expressed to silent ratio of 
substitution. Those behaviors that increase during this episode are deduced to be important for 
physiological ftinction. Those that do not are not. 

We now discuss using the basic evolutionary model in the context of tools that generate hypotheses 
concerning ftinction within and between protein families. 
I. Tools that detect change in function within a family of proteins. 

A. Ratios of silent to non-silent substitution along specific branches of an evolutionary tree 
including tools that address normalization issues. 

As discussed in Serial No. 07/857,224, during the divergent evolution of two proteins from a common 
ancestor, mutations of two types accumulate. The first have no impact on the abiUty of the host organism 
to survive, select a mate, and reproduce; these are called "neutral" mutations. The second influence the 
behavior of the protein in a way that influences the abiUty of the organism to survive, select a mate, and 
reproduce. These are termed "adaptive mutations." When evolving a new ftinction, proteins undergo an 
episode of rapid sequence evolution that corresponds to adaptive "positive selection", as is well known in 
the art [Kreitman, M., Akashi, H. Ann. Rev. EcoL Syst. 26, 403-422 (1995)]. 

Given a basic evolutionary model for a protein family, we can begin to search for sequence details that 
are indicative of function. For example, the genetic code is degenerate. Some mutations randomly 
introduced into a genome do not alter the encoded amino acid ("silent mutations"). Others do ("non-silent 
mutations"). When the gene is under no selective pressure at all, it makes no difference to natural selection 
whether the mutation changes an amino acid or not. Thus, mutations at the level of the gene are 
(essentially) neutral, and are fixed in a population without regard to whether they are silent or non-silent. 
The ratio of non-silent to silent changes can be normalized for the number of silent sites in a particular 
sequence to give Ka and Ks values. 

When the ftinction of a protein is constant, non-silent changes are usually detrimental. Non-silent 
changes are therefore removed by natural selection. Silent changes are not. The Ka/Kg value is therefore 
lower than unity in a protein divergently evolving under a constant set of ftinctional constraints. Indeed, 
for many proteins with function that has been estabUshed early in natural history (such as cytochromes), 
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the ratio approaches zero. At the start of the evolutionary period where the calculation is done, the protein 
is aheady doing its job nearly optimally, and neither needs nor wants to change its amino acids. 
Conversely, if one reconstructs the evolutionary history of a protein, and identifies an episode in that 
evolution where the non-silent/silent ratio is very much less than one, the genomic analysis suggests that 
the protein has a conserved function during that episode. 

One of ordinary skill in the art will note that this method assumes that codon selection is not strongly 
selected in metazoa. This is not true in eubacteria, or in highly expressed genes in yeast, for example. 
However, there is littie evidence in metazoa to suggest that codon usage is strongly selected in 
multicellular plants and animals (metazoa), including mammals, where most of the ORFs needing analysis 
for a developmental biology program are studied. Therefore, the presently preferred scope for methods 
involving the analysis of silent substitutions is in multicellular organisms. 

The exact opposite is the case when new function (implying, of course, new behaviors as well) is being 
engineered into a protein during an episode of evolution. Non-silent changes, those where amino acids are 
_ replaced at the level of the protein, are the only way to change the behavior of a protein to perform its new 

I — s 

role. Natural selection desires non-silent changes, as these create new behaviors. The Ka/Kg value is high. 
01 The ratio of non-silent to silent changes, normalized for the number of non-silent and silent sites (the 

^ Ka/Ks value) was introduced in the 1980s as a way of detecting change in function between proteins at the 

%j leaves of trees[Li97]. It was applied to a large number of cases (for an example, see [McD91][Jol89]). 

y Both the Applicant [Tra96] and Stewart and her coworkers [Mes971 extended this method to analyze 

7" reconstructed evolutionary events, calculating Ka/Ks values between ancestral nodes in an evolutionary 

Q tree, and applied it to individual cases (ribonuclease and lysozyme, respectively). Using this approach, if 

□3 

[2 one reconstructs the evolutionary history of a protein, and identifies an episode in that evolution where the 

p3 Ka/Ks value is greater than unity, the protein is evolving a new function during that episode. 

y In practice, Ka/Ks values are not so easily interpretable. Even when the function of a protein is 

changing, some residues (such as those holding together the fold) cannot change without destroying the 
ability of the protein to serve as a scaffold for function. Thus, the Ka/Kg value for specific sites can be 
very high during an episode of divergent evolution, perhaps even much higher than unity. But because 
Ka/Ks values are calculated for the sequence as a whole, the sites undergoing rapid substitution are 
counted with "core" sites undergoing slow substitution, giving a Ka/Ks value for the protein as a whole of 
less than unity. 

Likewise, Ka/Ks values are assigned to individual branches of an evolutionary tree. If the evolutionary 
tree is poorly articulated, a single branch may contain both adaptive and conservative episodes of 
evolution. In this case, the high Kg/Kg value for the adaptive episode may be diluted by a low Ka/Ks value 
for the conservative episode. The second problem will, of course, subside as more and more genome 
sequence projects are completed. 
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One solution to this problem involves normalization of the Ka/Ks values for a protein family. Here, the 
average Ka/Ks value for the average branch of the tree is calculated. Thos branches that have a Kg/Ks value 
an arbitrary factor higher (the presently preferred factor is two fold higher) are then hypothesized to be 
undergoing a change in function. More preferably, a statistical analysis is performed where the number of 
sites undergoing changes is determined for each branch length, the average Ka/Ks value is calculated, a 
statistical model is constructed to assess the distribution of Ka/Ks values on different branches of the tree, 
and branches that have Kg/Kg values lying more than two standard deviations above the mean are 
hypothesized to contain a change in function 

Serial No. 08/914,375 discussed in greater detail the tools based on the fact that the genetic code is 
degenerate. More than one triplet codon encodes the same amino acid. Therefore, a mutation in a gene can 
be either silent (not changing the encoded amino acid) or expressed (changing the encoded amino acid). 
Especially in multicellular organisms, and most particularly in multicellular animals (metazoa), silent 
changes are not under selective pressure. In contrast, expressed changes at the DNA level, by changing the 
structure of the protein that the gene encodes, change the property of the protein. 

When examining a protein from higher organisms during a period of evolutionary history where, at 
the outset of the period, the behavior of a protein is optimized for a specific biological function, and where 
that function remains constant for the protein throughout the period being examined, changes in the DNA 
sequence that lead to a change in the sequence of the encoded protein (expressed changes) will diminish 
the survival value of the protein [Benner, S. A., ElUngton, A. D. Interpreting the behavior of enzymes. 
Purpose or pedigree? CRC Crit Rev. Biochem. 23, 369-426 (1988)] and therefore will be removed by 
natural selection. During the same period, silent changes will not be removed by natural selection, but will 
accumulate at an approximately clock-like rate, as silent changes are approximately neutral, especially in 
higher organisms. Thus, the ratio of expressed to silent changes will be low during a period of evolution 
of a protein family where the ancestor and its descendants share a common function. 

In contrast, in genes for proteins that are neutrally drifting without functional constraints, the 
expressed/silent ratio will reflect random introduction of point mutations. Given the genetic code and a 
typical distribution of amino acid codons within the gene, a ratio of expressed to silent changes will be 
approximately 2.5 during the period of evolution of a protein family where the ancestor and its 
descendants have no fiinction. 

A third situation concerns a period of evolution where a protein is acquiring a new derived function. 
The amino acid sequence of the protein at the beginning of this episode will be optimized for the ancestral 
function, rather than the derived function. Thus, changes in the gene that are expressed in changes in the 
sequence of the encoded protein that improve the behavior of the protein as is required for the new 
biological function will be selected for. In proteins in such an evolutionary episode seeking new function, 
natural selection seeks expressed changes, and the ratio of expressed to silent substitutions at the DNA 
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level will be high during the period of evolution of a protein family where the function of the ancestor has 
changed with a new function emerging in its descendants. Ratios as high as 4: 1 or more are known. 

In a family of proteins defined by steps (a) through (e) above, individual periods of evolution are 
defined by lines between nodes on an evolutionary tree. In step (c), silent and expressed point mutations 
are assigned to individual periods of evolution. Periods of evolution with high ratios of expressed to silent 
mutations are episodes where physiological function is rapidly changing. Periods of evolution with low 
ratios of expressed to silent mutations are episodes where physiological function is slowly changing. 

Serial No. 08/914,375 showed the application of this approach applied to the leptin family of proteins. 
Leptins are present in mice, where they are believed to modulate feeding behavior. Leptin homologs are 
also present in humans, and the pharmaceutical industry has been excited about exploiting them in the 
treatment of obesity. The conclusion drawn from this hypothesis is that the leptin protein in humans does 
not have the same function as the leptin protein in mice. 

B. Covarion behavior, in which individual residues display different mutability in different 
branches of a tree. 

Functional changes leave signatures in the patterns of sequence evolution in a protein family. 
Covarion behavior was detected in alcohol dehydrogenase [Ben89] and superoxide dismutase [Miy95]. 
As a preliminary study in the past year, elongation factors (EF) serve as an example. These are proteins 
that have diverged far more slowly; indeed, they are archetypal examples of a protein that performs the 
"same" fiinction in all three kingdoms of life. In the example, thirty EF-Tu/EF-la protein sequences were 
ahgned over 380 sites using the alignment program Darwto. Replacement rates per site for bacterial and 
eukaryotic EFs were estimated using a gamma-based, maximum likelihood (ML) model for protein 
sequences (JTT + F) and the phylogeny of Baldauf et al [Bal96] for EF-Tu and EF- la. An a of 0.78 
was calculated for the entire tree, with a standard deviation (SD) of 0.05 using parametric bootstrapping 
(evolutionary simulations) [Swo96]. Interestingly, the a values for the bacterial and eukaryotic subtrees 
were significantly different from that for the entire tree [0.46 (0.04) and 0.38 (0.04), respectively]. These 
reductions in a for bacteria and eukaryotes alone are expected of a non-stationary covarion process. 

The distribution of rate differences per site between bacterial and eukaryotic EFs is leptokurtotic; i.e., 
over- and under-represented in the mean and tails versus "shoulders," respectively, relative to the 
expectations of a normal distribution. Thirty seven percent of the sites have essentially the same rate in 
the two groups (rate difference of --0), as expected under a stationary gamma process. However, 18 and 
21 sites evolve >2 SD faster in bacteria than eukaryotes, and vice versa, respectively. These 10% of the 
sites are most responsible for the covarion characteristics of EF-Tu and EF-la. 

Residues displaying abnormal evolutionary behavior were then mapped to a three dimensional model 
of the protein based on a crystal structure of ET-Tu. These were used to generate structural hypotheses 
for the different behavioral differences that were known. For example, bacterial EF-Tu binds GDP --100 
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fold tighter than GTP. Etikaryotic EF- la, in contrast, binds both with simUar affinities EF-Tu 
regenerates its active form by binding to the single-subunit nucleotide exchange factor EF-Ts. EF-la 
requires the multi-subunit nucleotide exchange factor EF-ipyS. EF-la also interacts with the cytoskeleton 
and may thereby play a role in cellular transformation and apoptosis. EF-Tu can have no such role in 
bacteria. Residues were identified that, at the level of hypothesis, are responsible for each of these 
behavioral differences. 

Covarion behavior indicates changing function. It is therefore expected to correlate positively with 
events with high KJKs ratios. Because Kg/Kg ratios use a silent substitution clock that ticks rapidly, while 
covarion analysis does not, the two are somewhat complementary. 

C. Detecting high absolute rates of amino acid substitution, changes per unit time. 

An altemative way to detect changes in function is to measure the number of amino acids substitutions 
that occur per unit time. This requires that dates be assigned to nodes in an evolutionary tree. This can be 
done by correlation with the paleontological record, as is well known in the art. 

II, Tools that detect conservation of function within a family of proteins. 
A. Compensatory changes 

The conservation of the overall fold after extensive divergences raises the possibility that amino acid 
substitutions at one position in a polypeptide chain might be compensated by substitutions elsewhere in a 
protein. For example, if a Gly at one position inside the folded protein core is replaced by a Trp, it might 
be necessary to substitute a Trp by a Gly at a position distant in the sequence but near in space to 
conserve the overall volume of the core, and therefore the overall folded structure. These assume that if a 
substitution is not compensated, the organism hosting the protein is less fit. 

Individual examples of compensatory changes in proteins have been proposed [Oos86], both by 
analysis of families of natural proteins with known structures 

[Les80][Les82][Cho82][Alt87][Alts88]Por90] and in proteins into which point mutations have been 
introduced by site-directed mutagenesis [Lim89][Lim92][Bal93]. In these examples, amino acid residues 
distant in the sequence but near in three dimensional space in the folded structure have been observed to 
undergo simultaneous compensatory variation to conserve overall volume, charge, or hydrophobicity. 

Compensatory covariation has been used in the prediction of the tertiary folds. For protein kinase 
[Ben91], for example, an antiparallel beta sheet was predicted for the core of the first domain because of 
two specific compensatory changes identified in consecutive strands in the predicted secondary structural 
model. The subsequently determined crystal structure [Kni91] showed not only that antiparallel beta sheet 
existed, but that the side chains of the two residues undergoing compensatory covariation were indeed in 
contact. 



21 




Steven A. Banner 



Systematic studies have suggested, however, that the compensatory covariation generates only a small 
signal. The early work by Lesk and Chothia with the globin family found that replacements of 
hydrophobic residues in the core of the protein fold are usually acconmiodated by small shifts of 
secondary structural elements rather than by size complementary amino acid substitutions 
[Les80][Les82][Cho82]. More recent studies have suggested that a weak compensatory covariation signal 
might exist [Tay94][Shi94][Gob94][Neh94]. Some authors have doubted, however, that the signal is 
adequate to be useful in structure prediction [Tay94]. Others have been more optimistic [Neh94][Shi94]. 
More recentiy, Chelvanayagam et al. pointed out that the signal might be improved if examples of 
compensatory covariation were sought within expUcit evolutionary context [Che97][Che98]. 

In the hterature, compensatory changes have been sought by comparing the sequences of two extant 
proteins from contemporary organisms. In principle, any position where an amino acid residue had 
undergone substitution at any point in the time separating the two proteins via the common ancestor might 
be paired with any other position that had also suffered substitution in this time. Such an approach is 
problematic because the evolutionary time separating two contemporary protein sequences can be long; in 
years, it is twice the time since the most recent common ancestor of the two proteins. 

A different way to detect compensatory covariation begins with the recognition that a model for the 
historical past in a protein family can be inferred from a set of homologous protein sequences These 
models have three parts: (a) an evolutionary tree, which shows the genealogical relationships between 
individual proteins in the family, (b) a multiple sequence alignment, which shows the evolutionary 
relationship between individual nucleotides in the genes encoding each family, and (c) reconstructed 
sequences of ancestral proteins that are evolutionary intermediates in the tree. Through the reconstruction 
of ancestral sequences, specific changes in a protein sequence can be assigned to (and isolated to) specific 
branches of the evolutionary tree. Within the context of a reconstructed model for the historical past, 
compensatory covariation should appear as two substitutions occurring on the same branch of the 
evolutionary tree. As these branches can be rather short in length, an analysis based on a reconstructed 
history of a protein family can identify changes that occur nearly simultaneously. These are expected to be 
true indicators of compensation. In principle, a weak compensatory covariation signal observed by the 
comparison of extant sequences should be strengthened by examining individual episodes in divergent 
evolution as reflected by specific branches in the evolutionary tree. 

In preliminary studies, we examined 71 families of proteins from the Master Catalog to leam whether 
reconstructed ancestral sequences will generate a more useftil signal for compensatory covariation than 
can be obtained by examining extant sequences. We noticed anecdotally that covariation was more likely 
to occur along branches with low Ka/Ks values. This makes sense, as compensation is necessary only if 
function is conserved. Case studies developed under this project will test this. 

B. Homoplasy 
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One feature commonly observed in the divergent evolution but not modelled well by even advanced 
stochastic models is molecular homoplasy, defined as a character similarity that arose independently in 
different subfamilies of an evolutionary tree [SdOO] 

Molecular homoplasy is best illustrated by an example (Drawing 3). Homoplasy so defined is the 
observed phenomenon; no statement is made as to the mechanism by which homoplasy arises. It may 
reflect selection pressures. The Master Catalog gives us the opportunity to systematically search for 
molecular homoplasy in the database as a whole. 

At one level, homoplasy is simply the statement that selective pressures are forcing the protein to select 
from a subset of the 20 standard amino acids. Thus, it is similar to the bias that is seen in membrane 
proteins, for example (where residues are chosen more frequently from a subset of hydrophobic amino 
acids than in the database as a whole). Homoplasy is more. Not only (in the example) is position 30 
limited to A and P, but the selection pressures have toggled between the two more than once in the 
module's evolutionary history. 

This is, of course, a signature that a functional constraint is conserved in the distant branches of the 
tree protein. For this reason, molecular homoplasy is expected to be a contrarian signature to high Ka/Ks 
or non-stationary covarion behavior in a protein. We expect it to occur more frequently with proteins that 
are not undergoing functional recruitment. 

Some informative features are already evident from preliminary work. For example, a preUminary 
search of 38 protein famiUes with high resolution crystal structures identified over 2000 examples of 
molecular homoplasy. These were characterized first by the nature of the amino acids identified. A 
number of very obvious patterns emerged. First, the majority of the examples involve the interchange of 
hydrophobic side chains of nearly identical volume. The homoplasy involving I and V was the most 
frequent. It occurred 230 times in the dataset. The W molecular homoplasy was far more abundant than 
the next most popular hydrophobic/hydrophobic homoplasy, F/Y, which was found 68 times, and the I/L 
hydrophobic/hydrophobic homoplasy, which was found 44 times. As might be expected, the majority of 
these were buried in the three dimensional structure of the protein. 

In the next phase of work we will ask whether these homoplasies are correlated with homoplasies at 
other positions in the same sequence in the same branches of the trees. If the functional constraint at the 
amino acid position are sufficient to permit a protein to confer fitness only if it places one of two residues 
there, then this constraint might be sufficient to cause compensation, also possibly homoplastic, at a 
second position nearby in the folded structure of the protein. Further, it is necessary to characterize the 
branch length (NED or PAM) where the changes occur. 

The most interesting homoplasies are those that involve multiple steps. For example, the Pro/Gly 
homoplasy (at the codon level, CCN to GGN) requires two substitutions. Either of these alone creates a 
change in the encoded amino acid (CGN, Arg, or GCN, Ala). Observing examples of these without 
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observing the intermediates anywhere else in the tree suggests that selection pressure is remarkably strong 
at this position, even though two amino acids appear to be nearly equally suited to perform function. 

Molecular homoplasy indicates a constraint on structure that impUes a constant behavior, which in turn 
implies a constant function. If this is true, it should correlate negatively with Kg/Ks ratios. That is, 
homoplasy should be found less frequently in branches separated by a branch with a high Kq/Ks ratio 
than in branches not separated by such a branch. Case studies developed under this project will develop 
ways to exploit such a correlation. 

C. Absolute conservation within a defined evolutionary distance 

As disclosed in Serial No. 07/857,224, residues that are conserved over an entire evolutionary tree are 
presumed (at the level of hypothesis) to be important for function, especially if they are chosen from the 
group consisting of Asp, Lys, Arg, Glu, Asn, Cys, His, Gin, Ser, and Thr. As disclosed in that application, 
however, it is important that the overall PAM width of the tree be considered before constructing 
hypotheses about the fiinctional role of conserved residues. 

III. Tools that identify individual residues involved in changes in functionally significant 
behavior. 

In Serial No. 08/914,375, it was disclosed that during episodes of rapid sequence evolution, amino 
acid substitutions will be concentrated in secondary structural elements. These are secondary structural 
elements that are important in the acquisition of new function. These elements might be predicted using 
the method claimed in Serial No. 07/857,224; they might also be known by X-ray crystallography or 
n.m.r., for example. As n Serial No. 08/914,375, a general method for identifying secondary structural 
elements that contribute to the origin of new biological function is comprised of identifying an element in 
the predicted secondary structure model where the corresponding section of the gene has a high ratio of 
expressed to silent changes. 

In this analysis, we must recognize tthat function involves combinations of behaviors of a protein. 
Even when function changes, some features of those behaviors are conserved, and this reflects 
conservation of some features of the sequence as well. In the fiimarase/aspartase/adenylosuccinate lyase 
example discussed above, all three proteins have the same overall fold. For this reason, residues critical to 
the folding process (for eample, amino acids whose side chains pack tightly into the folded core) will 
remain conserved even though the overall function of the protein is changing. Relevant to the change in 
function is, of course, a change in a number of behaviors, for example, the abihty to bind a particular small 
molecule substrate. Residues involved in substrate binding will dierefore be changing rapidly during the 
episode of sequence evolution where function was changing. 
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The notion that some residues are conserved even when function is chaning is matched by the notion 
that some residues will be changing even when function is conserved. The latter are those that can drift 
"neutrally". 

Likewise, "function" remains a concept set within Darwinian evolution. That is, a fumarase from a 
mesophile and a fumarase from a thermophile have analogous function in the sense that they both 
participate (for example) in the citric acid cycle. However, they have different functions, in that one 
contributes to fitness in a thermophile (which requires that it have an associated behavior, thermostability) 
while the other does not. Li the epsidoe where the temperature of the environment changes, residues 
involved in confening th ermal stability will change, while those involved in determining substrate 
specificity will not. 

Tools that assign, even at the level of hypothesis, which residues are involved in which behavior are 
extremely valuable. They can be the targets of protein engineering experiments, for example. Li these 
cases, one would like to map residues identified using tools of the instant invention on to a three 
dimensional structure of a representative member of a protein family. 

Already in 1988, the Applicant was using a general form of mapping that showed the utility of this in 
extracting information about the function of a protein, in this case, alcohol dehydrogenase [Ben88 xxx]. 
More recently, Lichtarge et al. introduced an evolutionary trace method that defined functionally 
significant residues as those that are conserved within a family [O. Lichtarge, H. R. Bourne, F. E. Cohen, 
An evolutionary trace analysis defines binding surfaces common to protein families. /. MoL BioL 257, 
342-358 (1996).]. They then used this approach to identify patches on the surface of proteins that 
contribute to functionaUty. 

As it was published, the evolutionary trace method was related to the method disclosed in Serial No. 
07/857,224, and was applied to conserve amino acid residues. The aproach did not contemplate the 
possibility that fimction might change within a family of proteins, and the residues important for function 
would change with it. Lideed, to detect such changes would require tools disclosed in this application and 
in Serial No. 08/914,375 to be broadly useftil. 

A. Residues changing in episodes with high Ka/Ks values, minus residues changing in 
episodes with low Ka/Ks values 

We have posited that fiinction is changing during an episode with high Ka/Ks values. As disclosed in 
Serial No. 08/914,375, individual residues can be identified as changing during that episode, as the basic 
evolutionary model has sequences reconstructed at each individual node. These are, at the level of 
hypothesis, residues that are important to functional change. 

As one of ordinary skill in the art recognizes, the episode also includes a number of substitutions that 
have no relevance to function or the change in function, but rather reflect the background, neutral drift. For 
example, these residues might lie on the surface of the protein, be in contact with bulk solvent, and not 
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have any especially strong ftinctonal constraint that prevents them from diverging. As disclosed in Serial 
No. 07/857,224, surface residues are likely to be neutrally drifmg in many sub-families within an 
evolutionary tree. For this reason, we can identify residues that are changing along branches of an 
evolutionary tree that have low Ka/Kg values, and subtract them from residues changing in episodes with 
high Ka/Ks values. WTiat remains are residues more likely, again at the level of hypothesis, to be involved 
in the change in function. 

Serial No. 07/857,224 disclosed and claimed methods for correlating changes in sequence with 
changes in the behavior of the protein. This in turn provides a method for identifying behavioral changes 
that are relevant to the change in fimction. 

B. Residues displaying covarion behavior 

Again because the basic evolutionary model includes reconstructed ancestral intermediates, the 
methods of the instant invention identify specific residues that are displaying covarion behavior. These are 
residues that are under analogous functional constraints in different sub-families of the tree. This, in tum, 
impUes that these particular residues contribute to a behavior that is conserved for a conserved feature of 
the function in distant branches of the tree. 

C. Mapping these residues on to models for the secondary, tertiary, and quaternary structure of 
proteins. 

Insight into the relationship between function and amino acid sequence can be gained by mapping 
residues identified by Kg/Ks and covarion analysis onto a three dimensional structure. This identifies, for 
any particular branch, which residues are involved in changing function. This information is useful when 
attempting to identify residues that might be changed in a protein engineering experiment, for example. 

IV. Tools that identify individual residues involved in conserved of functionally significant 
behavior 

The type of analysis used for class IQ tools can also be applied to class IV tools. 

A. Residues suffering compensatory changes 

When a pair of residues suffers compensatory changes during a particular episode of protein 
sequence evolution, this implies that some physical property of the protein family must be the same at the 
end of the episode as it was at the beginning. This implies some conserved behavior important across that 
episode. The episode can, of course, be one where function in some sense is changing. Thus, in the 
fumarase/aspartase example outUned above, one might identify residues die suffer compensatory changes 
during episodes where catalytic behavior is changing. These are residues most likely (at the level of 
hypothesis) to be important for folding, which is conserved over this episode. We can therefore use the 
methods of the instant invention to identify individual residues involved in conserved of functionally 
significant behavior 

B. Residues displaying homoplasy 
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Positions that display homoplasy are subject to analogous functional constraints in different branches 
of the tree. Because of the evolutionary reconstructions in the basic evolutionary model, we know which 
positions they are are which amino acids involved. Therefore, we use the methods of the instant invention 
to identify individual residues involved in conserved of functionaUy significant behavior 
C. Mapping these residues on to models for the secondary, tertiary, and quaternary structure of 

proteins. 

Insight into the relationship between function and amino acid sequence can be gained by mapping 
residues identified by Ka/Ks and covarion analysis onto a three dimensional structure. This identifies, for 
any particular branch, which residues are involved in changing function. This information is useful when 
attempting to identify residues that might be changed in a protein engineering experiment, for example. 

V. Tools that involve correlation between the evolutionary histories of two families of proteins 

Serial No. 07/857,224 introduced in the first useful form the notion of compensatory changes as a 
way of analyzing divergent evolution in protein sequences. In that appUcation, an example of 
compensatory covariation was identified that indicated the packing of two beta strands in an antiparallel 
fashion. A second use for compensatory changes disclosed was as part of a tool to detect disulfide bonds 
in a protein; cysteines that arise and/or disappear at the same time during the divergent evolution of a 
protein family frequently form a disulfide bond with each other. Serial No. 08/914,375 extended this 
notion, noting that the introduction and loss of leptin and the leptin receptor might occur in parallel. The 
idea behind this analysis is that residues that interact as. they contribute to function, subunits that interact 
as they contribute to function, and even proteins that interact as they contribute to function, display 
correlated evolution. 

Since these appUcations were filed, various other groups have extended this approach. We review 
briefly two of the areas where research is active, and make comments on why additional invention is 
necessary to make these approaches fully useful 

A. Correlating the topology of evolutionary trees in two families of proteins 

Recently, Pellegrini et al. extended this type of analysis to generate "protein phylogenetic profiles" for 
different organisms [Pellegrini, M., Marcotte, E. M., Thompson, M. J., Eisenberg, D., Yeates, T. O. 
Assigning protein functions by comparative genome analysis: Protein phylogenetic profiles PNAS 96, 
4285-4288 1999]. They present a method that assumed that during evolution, proteins that function 
together tend to be either preserved or eliminated in a new species. They described this property of 
correlated evolution by characterizing each protein by its phylogenetic profile, a string that encodes the 
presence or absence of a protein in every known genome. They suggested that proteins having matching 
or similar profiles strongly tend to be functionally Unked. This method of phylogenetic profiling allows us 
to predict the function of uncharacterized proteins. 
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More recently, Cohen and his coworkers used phosphoglycerate kinase (PGK), an enzyme that forms 
its active site between its two domains, to develop a standard for measuring the co-evolution of interacting 
proteins. The N-terminal and C-terminal domains of PGK form the active site at their interface and are 
covalently linked. Therefore, they must have co-evolved to preserve enzyme function. By building two 
phylogenetic trees from multiple sequence alignments of each of the two domains of PGK, they calculated 
a correlation coefficient for the two trees that quantifies the co-evolution of the two domains. The 
correlation coefficient for the trees of the two domains of PGK is 0.79, which establishes an upper bound 
for the co-evolution of a protein domain with its binding partner. Their analysis was extended to ligands 
and their receptors, using the chemokines as a model [Goh, C.-S., Bogan, A. A., Joachimiak, M., Walther, 
D., Cohen, F. W. (2000) Co-evolution of Proteins with their Interaction Partners. 7. MoL Biol xxx, 283- 
293. 

We have no quarrel with either of these approaches; indeed, they are in some ways covered by the 
Apphcant's earlier disclosures. It should be recognized, however, that these simple approaches that exploit 
evolutionary analysis are easily defeated by the "ortholog paralog problem", especially when it is coupled 
with gene loss. Briefly, paralogs are generated when a gene duplication occurs internally within a genome, 
to create two homologous genes in the same organism. 

B. Correlating the connectivity of proteins in a gene family 

Eisenberg and his coworkers. Enright [AJ] et al., and others have also suggested that proteins that 
interact in a pathway might be connected physically in the genome, either as an operon or, in some cases, 
in a single expressed polypeptide chain. This interesting approach is applicable to only a subset of the 
database, and is distinct from the tools disclosed here. [Marcotte, E.M., M. Pellegrini, H.L. Ng, D.W. 
Rice, T.O. Yeates, and D. Eisenberg. 1999. Detecting protein function and protein-protein interactions 
from genome sequences. Science. 285: 751-753] 

C. Dating events in the molecular history 

A key element to using evolutionary analysis of correlated change in protein families is to establish that 
the changes being interpreted as evidnce that two proteins interact as they function is to show that the 
changes are contemporaneous, that is, they occur near the same time. This requires tools that date, if only 
approximately, events in the molecular evolutionary tree using sequence data. 

Early hope that protein sequences might change in a "clock-like" fashion [Can82], with a small number 
of rate constants describing the rate of change at most positions in most proteins in most organisms, has 
given way to the reality that the evolution of protein sequences is marked by episodes of rapid and slow 
evolution [Mes97]. These correspond to changing and conserved function within the protein family, 
arising in turn from adaptive and purifying natural selection, respectively. This makes methods based on 
protein sequence divergence unreliable for dating the divergence of protein sequences. 

One well known approach to avoid (to a large extent, at least in metzoans) the mfluence of purifying 
and adaptive selection on the interpretation of molecular history is to examine changes in non-coding 
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regions of DNA [Li97]. These include introns and substitutions, generally at the third position of a codon, 
that do not change the encoded amino acid. These arise because the genetic code is redundant for many 
amino acids. This approach assumes that silent substitutions at the DNA level have Uttle or no impact on 
fitness (are neutral or nearly neutral) at the level of the organism. While this is almost certainly not a good 
approximation in microorganisms, the approximation appears to be serviceable for metazoans 
(multicellular animals) and plants, presumably because macrophysiology is more visible to selective forces 
than genome sequence itself in multicellular organisms. 

Even silent substitutions are problematic as a molecular clock, however From a chemical perspective, 
intercon verting the four standard nucleobases A, G, T, and G involves 12 rate constants that need not be 
identical [Nei86]. Some models distinguish between transitions (purines replaced by purines, or 
pyrimidines replaced by pyrimidines) and transversions (purines replaced by pyrimidines, or pyrimidines 
replaced by purines), but otherwise group the rate processes together. This problem is revisited frequendy 
in the literature [Nei86]. The most widely used method was developed by Li [Li85] with modifications by 
Pamilo and Bianchi [Pam93]. This method aggregates four fold redundant and two fold redundant sites, 
analyzes nucleotide substitution at positions where the encoded amino acid has not changed at the same 
time as it analyzes substitution at positions where the encoded amino acid has changed, and adopts a 
classification of different types of substitutions based on physical chemical characteristics of amino acids. 

Disclosed here for the first time, the Applicant has discovered good part of the inconsistency in the 
dating generated by these methods can be eliminated if one focuses on relatively homogeneous chemical 
processes. In particular, transitions accumulate over large periods of (for example) vertebrate history with 
remarkable constancy, with a pseudo first order rate constant of 3.0 x 10-^ changes/base/year. A tool 
based on this discovery begins by extracting aUgned pairs of codons from a pairwise alignment where two 
fold redundant amino acids (CDEFHKNQY) are conserved. Substitution at the silent position is then 
modelled using an exponential "approach to equihbrium" rate law, where f2 is the fraction of the codons 
encoding conserved 2FR amino acids that are themselves conserved: f2 = [0.5»exp(-itr)] + 0.5, where k is 
a single pseudo first order rate constant for transitions, and t is the time. The neutral evolutionary distance 
(NED ) between two genes x and y is defined by NEDx,y= ktx^y = -ln[(f2;c,>H-0.5)/0.5]. 

NEDs represent one choice in a trade-off, between the instinct of a statistician (to maximize the number 
of characters being examined, and hence minimize error due to fluctuation) and the instinct of an organic 
chemist (to seek homogeneous rate processes, and hence minimize systematic error due to aggregation of 
different kinds of events). 

The NED is a measure of evolutionary distance, not evolutionary time. If one knows the rate constant, 
and assumes that k is constant over the period of evolutionary history being examined, one can calculate 
the time of divergence. Given the same assumption and the date of evolutionary divergence of two 
sequences, one can calculate L As distances, NEDs are additive, should obey the triangle inequality, and 
display other features that permit them to be used to build evolutionary trees. 
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The transition-based two fold NED turned out to be remarkably robust measures of evolutionary time. 
When caUbrated using datable fossil divergences back to the divergence of fish from land vertebrates, a 
single hneage rate constant of 3 x 10-9 changes per base per year was obtained in many of the cases we 
examined, appUcable (within error) to the divergence of fish from mammals, reptiles and birds from 
mammals, primates from artiodactyls, and artiodactyl genera from other artiodactyl genera. NEDs built 
from four fold redundant systems were far less consistent. 

One of the key issues in the development of evolutionary models is assigning ranges of geological 
dates to nodes in the tree. Early hope that protein sequences might change in a "clock-hke" fashion, with a 
small number of rate constants describing the rate of most amino acid substitutions in most proteins in 
most organisms, has given way to the reality that the evolution of protein sequences is marked by episodes 
of rapid and slow evolution. These correspond to changing and conserved function within the protein 
family, arising from adaptive and purifying natural selection respectively. This makes protein sequence 
similarity (for example, point accepted mutations per 100 amino acids, or PAM units) unreliable for dating 
the divergence of protein sequences. 

One well known approach to avoid the influence of purifying and adaptive selection on the 
interpretation of molecular history is to examine changes in non-coding regions of DNA. These include 
inttons and substitutions, generally at the third position of a codon, that do not change the encoded amino 
acid. These arise because the genetic code is redundant for many amino acids. Amino acids encoded by 
four synonymous codons (A4's) are valine, alanine, threonine, proline and glycine. Amino acids encoded 
by two synonymous codons (A2's) are cysteine, aspartic acid, glutamic acid, phenylalanine, histidine, 
lysine, asparagme, glutamine, and tyrosine. One amino acid (isoleucine) is encoded by three synonymous 
codons (As's). These patterns are found in the eukaryotic nuclear code; other codes exist, of course. 

This approach has a chance of working if silent substitutions at the DNA level have littie or no impact 
on fitness at the level of the organism. While this is ahnost certainly not a good approximation in 
microorganisms (at least for some codons in highly expressed genes), the approximation appears to be 
serviceable for metazoans (multicellular animals), presumably because redundant codon exchange does 
not change the stincture or the behavior of any functioning protein, and die structure and behavior of 
fiinctioning proteins, togetiier witii the consequent macrophysiology, is more visible to selective forces 
than genome sequence itself. The approach is now empirically shown to be rehable wittiin chordates. 

Even silent substitiations are problematic as a molecular clock, however. From a chemical perspective, 
interconversion of the four standard nucleobases A, G, T, and G involves 12 rate constants tiiat need not 
be identical (tiiere is a large Uterature on tiiis; see for example [Nei86]). Simpler models have distinguish 
between transitions (purines replaced by purines, or pyrimidines replaced by pyrimidines) and 
transversions (purines replaced by pyrimidines, or pyrimidines replaced by purines), but otherwise 
grouped the rate processes together. 
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This problem has been revisited frequently in the literature. The most widely used method (indeed, the 
one implemented in the present version of the Master Catalog when assigning Kg/Ks values, following 
some adaptations that we made, Schreiber, Benner unpublished) was developed by Li [Li85] with 
modifications by Pamilo and Bianchi [Pam93] following a suggestion by Kimura. 

In the previous funding period, we developed and tested a NEDs as a tool for dating sequence 
divergences Table 1). NEDs turned out to be remarkably robust measures of evolutionary time. When 
caUbrated using datable fossil divergences back to the divergence offish from land vertebrates, a single 
lineage rate constant of 3 x lO'^ changes per base per year was obtained in many of the cases we 
examined, applicable (within error) to the divergence of fish from mammals, reptiles and birds from 
mammals, primates from artiodactyls, and artiodactyl genera from other artiodactyl genera. Statistical 
analysis suggests that >80% of the variance arises from simple statistical fluctuation. This suggests the 
absence of "hot spots" and other non-stochastic variation at the 2-fold degenerate sites in the genome. 
Again, relatively extensive tools (such as full blown ML tools) gave insignificantly different results than 
relatively cheap tools (such as the PamiUo-Bianchi approach) in a series of test cased that were appUed m 
parallel. 



Table. Average NED values for Pairs of Proteins Extracted from Humans, Pigs, Oxen, Rabbit, Rat, and Mouse 



Species 1 


Species 2 


Number 


kt (range) 


Date 


k (calc.) 


k (average) 






of pairs 


(NED) 


(fossil) 


xl09 


xl09 


Human 


Pig 






MYA 


changes/base/year 


225 


0.3990 


80 


2.5 




Human 


Ox 


410 


0.3800 


80 


2.4 


2.4 


Pig 


Ox 


140 


0.2755 


60 


2.3 


Rabbit 


Human 


203 


0.4845 


80 


3.0 




Rat 


Human 


584 


0.4893 


80 


3.0 


3.1 


Mouse 


Ox 


147 


0.5130 


80 


3.2 


Mouse 


Human 


918 


0.4988 


80 


3.1 




Mouse 


Rabbit 


87 


0.5083 


60 


4.2 


5.2 


Mouse 


Rat 


926 


0.2470 


20 


6.2 



D. Correlating evolutionary events in two protein families occuring at approximately the same 
time 

Given approximate dates, we can now provide a more useful tool to correlate events occurring in two 
trees. A duplication in family 1 that is occurring near the time as a duplication occurring in family 2 is 
hypothesized to indicate that the two famiUes (and, in particular, the proteins arising from the duplication) 
interact when they ftmction. Conversely, and frequently quite usefrilly, a duplication in family 1 that did 
not occur near the time as a duplication occurring in family 2 is hypothesized to indicate that the two 
proteins arising from the duplication do not interact when they function. These hypotheses are uefiil when 
designing two-hybrid systems, for example, to detect protein-protein contacts. 
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E. Correlating evolutionary events in two protein families that are associated with analogous 
behavior involving expressed/silent ratios 

When there is a duplication, the question arises: Which of the derived genes is performing the derived 
function, and which is performing the ancestral function? According to the method of this invention, the 
derived protein is the one connected to the node where the duphcation has occurred via the higher Ka/Ks 
value. This concept supports a useful tool to correlate events occurring in two trees. A duphcation in 
family 1 that is occurring near the time as a duphcation occurring in family 2 is hypothesized to indicate 
that the proteins arising from the duphcation from die branch having the higher Ka/Kg value in one tree 
interact when they function with the proteins arising from the duphcation from the branch having die 
higher Ka/Ks value in one tree interact when they function with the. Conversely, and frequentiy quite 
usefully, when examining two contemporameous duphcation events in two separate families, die proteins 
in family 1 that do not interact with the proteins in family 2 are those that are not joined to their respective 
nodes via branches diat display, during contemporaneous periods of evolution, high Ka/Ks values. 

As one of ordinary skill in the art will appreciate, diis approach is quite general, and can be apphed 
with covarion behavior, compensatory substitution, homoplasy, and even levels of high sequence 
conservation. 

VI. Tools that involve correlation between the evolutionary history of a family of proteins and 
the evolutionary history of the organism as known from some source other than genomic 
sequence data, including paleontology, geology, ecology, ontogeny, phylogeny, or systematics 
(collectively known as the "non-genomic record". 

The methods of this invention extract information about function and function change by analyzing 
sequence data alone, and dien by coupling this analysis with secondary, tertiary, and quaternary structural 
data. Those of ordinary skill in the art know, of course, of other sources of evoluionary information that 
does not come from genomic sequence data or crystal structures. These "non-genomic" data come from 
paleontology, geology, ecology, ontogeny, phylogeny, and systematics (collectively known as the "non- 
genomic record"). 

A. Correlating the topology of an evolutionary trees and the non-genomic record. 

Conversely, and quite usefully, when a node in an evolutionary tree 

Dates can be obtained approximately by protein sequence analysis, hi cases where silent substitutions 
have not equilibrated, NED distances or other distances based on the analysis of silent codon substitutions 
can be used. 

As discussed above, detailed analyses of evolutionary histories frequendy can provide a solution to the 
most general problem of die conventional evolutionary paradigm, the difficulty in routinely identifymg a 
homolog of a target sequence with known function within the database. By analysis of non-Markovian 
evolutionary behavior at the level of the protein, a model of secondary structure can be predicted. This 
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prediction can be used in turn to detect long distance homologs in some cases and exclude the possibility 
of distant homology in others. This increases the likelihood that a homolog will be found with a known 
structure, behavior, or function for a new protein sequence. If one is found, then the logic associated with 
the conventional evolutionary paradigm can be applied to generate a hypothesis concerning the behavior or 
function of the protein. 

The value of this post-genomic tool to assign behavior and structure to a target sequence problem is 
expected to grow over the near term, as the ratio of sequences supported by experimental studies to those 
not supported increases with the conclusion of genome projects, and as more sequences increase the detail 
of the evolutionary histories that can be extracted from the database directly, and therefore the quahty of 
the predicted secondary structural model. 

At the next level, analysis of non-Markovian behavior at the level of the gene can alert the biological 
chemist that the logic associated with the conventional evolutionary paradigm might not apply in individual 
cases. In particular, if an episode of rapid sequence evolution intervenes in the evolutionary tree between 
the sequence of interest and the sequence with the know behavior and function, the biological chemist is 
alerted to the possibility that the function of the protein might have changed. This alert is useful even with 
close homologs, as illustrated in the example with leptin. 

But what if the evolutionary tree contains no protein with a sequence with assigned function, even one 
with low sequence similarity? Even with more limited evolutionary histories, post-genomic tools that 
analyze non-Markovian evolution at the level of the codon can be useful. By identifying the organisms 
that provide the sequences at the "leaves" of the evolutionary tree, it is frequently possible to correlate 
branches in the evolutionary tree with episodes in geological history, as determined from the fossil record. 
Especially in multicellular animals (metazoa), the fossil record can provide approximate dates for the 
emergence of new physiological function. In this case, it is possible to ask whether an episode of rapid 
sequence evolution in a protein family (in particular, an episode with a high expressed/silent ratio) 
occurred at the same time as a new physiological function emerged on earth. If so, a first level of 
hypothesis about physiological function can be proposed, even if no behavior or function of any kind is 
known for any of the modem proteins. 

Perhaps the most transparent analysis of this type concems proteins that underwent massive radiative 
divergences m metazoa approximately 600 million years ago. This is the time of the Cambrian explosion, 
an episode in terrestrial history that marks the massive radiative divergence of multicellular animals, 
including chordates. Proteins families undergoing rapid evolution at this time (for example, of protein 
tyrosine kinases and src homology 2 domains) are ahnost certainly involved in the basic processes by 
which multicellular animals develop from a single fertilized egg. 

This type of analysis might be appUed in the family of ribonuclease (RNase) A (E.C.2.7.7.16), a well 
known family of digestive proteins found in ruminants. The protein underwent rapid sequence evolution 
approximately 45 million years ago, a time where ruminant digestion emerged in mammals [T. M, 
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JERMANN, J. G. OPITZ, J. STACKHOUSE, J. and S. A. BENNER, ReconstrucUng the evolutionary 
history of the artiodactyl ribo nuclease superfamily. Nature 374, 57-59 (1995).]. Thus, the rapid 
molecular evolution evident in the reconstructed evolutionary history of this protein suggests that the 
protein is important for ruminant digestive function. 

Correlating features of patterns of evolution in specific branclies in the evolutionary tree 
with the non-genomic record 

This type of analysis is obviously strengthened if one adds now information concerning Ka/Ks values, 
covarion behavior, homoplasy, and compensatory changes. 

C. Correlating evolutionary events in several protein families occuring at approximately the 
same time with the non-genomic record 

This type of analysis can obviously contribute to the determination of pathways, interactions between 
^ proteins from different families. These hypotheses are uefiil when designing two-hybrid systems, for 
2 example, to detect protein-protein contacts. 

Use of non-stochastic behavior generally 
y One of ordinary skill in the art will recognize from Serial No. 07/857,224 that the methods of the 
~ instant invention view molecular evolution in a way quite distinct from the way in which standard tools 
a analyze protein sequence data. Virtually all tools for comparing the sequences of homologous proteins 
2 assume a model for divergent evolution that is stochastic in outcome. This model treats a protein sequence 
K= as a linear string of letters, one letter for each amino acid. According to the model, each letter in the string 
y changes (the gene and its corresponding protein mutates) at a rate that is independent of its position. 
~" According to the stochastic model, ftiture and past mutations are independent. Mutations at one position 
are independent of mutations elsewhere. 

Such a model is at best an approximation for the reality of protein evolution. In reality, proteins are not 
linear strings of letters. Rather, they are organic molecules that fold in three dimensions. In the folded 
form, some positions in a protein sequence are more easily mutatable (without destroying fiinction) than 
others. Amino acids distant in the sequence but close in the fold frequently undergo correlated mutation. 
Futtu-e mutations are frequently not independent of past mutations. Thus, real proteins divergently 
evolving under functional constraints behave differently than expected based on the stochastic model. 

The difference between the reaUty of divergent evolution of proteins that fold and expectation based on 
the stochastic model proves to be important, as was disclosed first in Serial No. 07/857,224. By 
comparing the patterns of substitution within a set of folded proteins undergoing divergent evolution with 
expectations for those patterns based on the stochastic model, one can extract information about the fold. 
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This makes the nuclear family more than a database organizational feature. Because the nuclear family 
holds a history of the pattern of divergent evolution under functional constraints in the protein, it holds 
information about the fold of the protein. From the sequences of proteins in the nuclear family alone, one 
can decide which amino acids lie on the surface of the folded structure, which lie inside, and which lie near 
the active site. Elements of secondary structure, the helices, strands, and loops can be identified. A model 
of tertiary structure can be built as well, all from the evolutionary history embodied in the nuclear family. 

EXAMPLES 

Example 1. Functional analysis of aromatase 

Aromatase is a cytochrome P450-dependent enzyme that catalyzes a three step reaction that creates 
an estrogen from an androgen. The physiological consequences of estrogen biosynthesis in human 
biology are well known, even among laymen. Estrogen is also syntiiesized in primitive chordates such as 
Amphioxus (Callard et al., 1984), but not in other metazoans. Therefore, estrogen appears to have been 
invented as a hormone early in the divergent evolution of chordates, presumably by recruitment of steroids 
involved in developmental biology in more primitive metazoan ancestors. 

Aromatase belongs to the cytochrome P450 superfamily of enzymes, which has some two dozen 
family members (Nebert et al., 1991). Members of the superfamily use a common chemical mechanism 
(Akhtar et al, 1997) to assimilate carbon, detoxify organic substances, and synthesize regulatory 
molecules. In biomedicine, variants of P450 oxidases can determine whether individuals have side effects 
to a therapeutic agent (Gonzalez & Nebert, 1990), and aromatase itself plays a significant role in the 
progression of some cancers. 

Recent research has found remarkable complexity in the molecular biology of the aromatase gene 
family. Two aromatase genes are known in goldfish (Callard and Tchoudakova, 1997). In contrast, only a 
single gene is known in the horse (Boerboom et al., 1997), the rat (Hickey et al., 1990), the mouse 
(Terashima et al., 1991), the human (Harada, 1988), and the rabbit (Delarue et al, 1996). Both a functional 
gene and a pseudogene are found in oxen. The pseudogene is built from homologs of exons 2, 3, 5, 8, and 
9 interspersed with a bovine repeat element (FiirbaB & Vanselow, 1995); it is transcribed but not 
franslated. In several mammalian species, a single gene yields multiple forms of the mRNA for aromatase 
in different tissues via alternative splicing mechanisms. This is the case in humans (Simpson et al., 1997) 
and rabbits (Delarue et al. 1998). 

A still different phenomenology is observed in the pig {Sus scrofa). PreUminary studies found three 
distinct mRNA molecules in different tissues with differences in their coding regions (Conley et al. 1996; 
Conley et al. 1997; Choi et al., 1996; Choi et al., 1997a; Choi et al., 1997b). It was suggested that these 
might have arisen from a single gene, possibly via RNA editing or alternative spUcing (Conley et al. 
1997). 
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Analogous collections of phenomenology are found throughout contemporary molecular biology 
for many molecular systems. "Why?" questions are often confounded by the complexity of the 
phenomenology. When "just so" stories are proposed, they need not be compelling, especially when they 
are supported by no evidence past the phenomena themselves. 

One approach to obtain additional evidence to address fiuictional questions in systems requires 
placing the molecular biological phenomena within an evolutionary context. To do this for the aromatases 
family, we began with experiments to determine whether the three mRNA isoforms (and the 
corresponding proteins) in pig arose through alternative splicing, via mRNA editing, or from distinct 
genes. PGR primers were designed from sequences located within the previously characterized exon 4 of 
the porcine aromatase type m gene (Choi et al., 1996, 1997a), a region that the cDNA studies suggested 
might have internal sequence differences (Choi et al., 1997a; Conley et al., 1997) and used to amplify pig 
genomic DNA. Initially, eight clones of the PCR products were sequenced. Four of these had the 
sequence corresponding to aromatase isoform I (ovarian type) as identified from cDNA, while four others 
had the sequence corresponding to aromatase isoform m (embryo type) as identified from cDNA. 

With evidence that at least two aromatase genes could be found in pig genomic DNA, a restriction 
enzyme-based assay was designed to search genomic DNA in greater detail. Nsi I digests exon 4 from 
isoform I twice, and isoform m once. Bsm I digests exon 4 from isoform I once, but not exon 4 of 
isoform m. Exon 4 from isoform n (placental type) had no restriction sites for either enzyme. Restriction 
analysis of a total of 23 clones obtained from genomic DNA identified 8, 5, and 10 representatives of 
isoforms I, n, and m, respectively. No restriction digestion pattems indicative of a novel sequence were 
observed. Representative clones for isoforms I, H, and m were then sequenced. To further confirm the 
presence of exactly three aromatase isoforms within the porcine genome, primer pairs were designed from 
within the 5' and 3' junctions of exon 7. Sequence analysis of 10 clones derived from the PCR products 
identified six and four clones of isoforms II and HI, respectively 

With compeUing evidence that the three variants of mRNA identified in cDNA studies arose from 
three paralogous genes (as opposed to editing or alternative spUcing), we sought to place the paralogous 
genes witiiin tiieir historical context. Following standard tools to analyze protein sequences, pairwise 
alignments were constiucted for the 136 pairs of proteins. An evolutionary distance (in PAM units) was 
calculated (witii a variance) for each pair (Table 1). From this, an evolutionary ft-ee was built for the 
mammaUan sequences (Drawing 4), with branch lengtiis along internal nodes calculated to minimize a 
least squares distance were then constructed witiiin tiie Darwin programming environment. The ti-ee was 
adjusted to make tiie human and equine branchings consistent with paleontological records to obtain a 
"best consensus" ti-ee. The sequences of tiie ancestral genes and proteins at branch points in tiie tree were 
then reconstructed. From tiiere, mutations (including fractional mutations) at botii tiie DNA level and 
protein level were assigned to individual branches in tiie tree using tiie metiiod of Fitch (1971). 
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Based on the tree and the reconstructed evolutionary intermediates, Kg/Kg values were assigned to 
individual branches using the method of Li et al. (1985). These reflect the normalized ratio of 
substitutions at the level of the gene that change the encoded polypeptide sequence (non-synonymous 
substitutions) to substitutions at the level of the gene that do not change the encoded polypeptide sequence 
(synonymous substitutions). Lower KJKs values generally reflect conservative episodes of evolution 
where function remains constant, while higher values frequendy characterize episodes of evolution where 
function is changing (Trabesinger-Ruef et al., 1996; Messier & Stewart, 1997). 

The average branch in the aromatase evolutionary tree has a value of Kg/Ks of 0.348. Inspection of 
the tree shows that die highest Kg/Ks values anywhere in the mammalian aromatase family (0.85 and 0.66) 
are found within the divergent evolution of the pig aromatases. These suggest that adaptive changes 
occurred during the triplication of the aromatase gene in pigs. Adaptive changes are well known to 
confuse simple models of molecular history built from standard sequence alignment and tree construction 
tools. Adaptive substitutions do not conform to stochastic rules modelling divergent evolution (Benner et 
al, 1997), do not accumulate in a clock-like fashion, and may arise through convergent and parallel 
evolution (Stewart et al., 1987). 

Therefore, the evolutionary history of the aromatase family was re-analyzed using pairwise Neutral 
Evolutionary Distances (NEDs) (Liberies et al, 1999), obtained for the 136 pairs of aligned aromatase 
genes (Table 2). To estimate NEDs between the aromatase gene pairs, the number (n) of "2-fold 
redundant amino acids" (Cys, Asp, Glu, Phe, ffis, Lys, Asn, Gin, and Tyr) that are conserved in the aligned 
pairs was determined. The number of diose amino acids that are encoded by the same codon (c) was then 
determined, and the fraction (f2 = c/n) of the codons that are the same is then tabulated (Table 2). 

A variety of empirical studies show that the fixation of silent substitutions in conserved 2-fold 
redundant codon systems follows rate law that is a simple exponential "approach to equilibrium" f2 = 
[0.5«exp(-^/)] + 0.5, where ^ is a single pseudo first order rate constant for transitions, and t is the time 
(Jukes & Cantor, 1969). The NED distance is defined by NEDjc,j= kt^y = ln[(f2;c,3'+0.5)/0.5]. 

The NED is a measure of evolutionary distance, not of evolutionary time. As distances, NEDs are 
additive, should obey the tiiangle inequality, and display other features that permit them to be used to build 
evolutionary 0-ees, provided that k is constant over the period of evolutionary history being examined. A 
variety of empirical studies shows tiiis to be approximately the case for many protein famiUes. The 
approximation appears to be quite good for aromatase as well. Thus, if a fixed single lineage first order 
rate constant of 3 x 10-9 changes per base per year is assumed, the NED values indicate that fish and land 
vertebrates diverged 340 million years ago (mya), birds and manmials diverged 250 mya, primates and 
ungulates diverged 73 mya, horse and artiodactyls diverged 71 mya, and pigs and ruminants diverged 62 
mya. Each of tiiese dates is close to the date suggested by die paleontological record (Carroll, 1988). 

The NED-based dating was used to assess two altemative models to explain the tiipUcation of 
aromatase gene family in pigs. The fu-st, advanced by Callard and Tchoudakova (1997), holds tiiat die 
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physiological specialization of aromatases through the formation of paralogs occurred early in vertebrate 
divergence, perhaps 400 mya, before fish and mammals diverged. If this were the case, then a functional 
explanation for the aromatase genes must be sought in fundamental features of vertebrate developmental 
biology, those that emerged early in vertebrate evolution. Conversely, the triplication of aromatase may 
occur in response to the domestication of pigs. In this case, a functional explanation for the aromatase 
genes would be found in the selective pressures applied by breeding programs. 

The NEDs separating the three pig isoforms range from 0. 154 (corresponding to a distance of 5 1 
million years between the proteins) to 0.199 (corresponding to a distance of 66 milhon years). 
Recognizing that the total distances between two proteins are twice the distance along a single Uneage 
from the point of divergence to the modem protein (half of the distance occurrs along one lineage after 
divergence, and half of the distance occurs along the other lineage), the NEDs suggest that the first 
dupUcafion led to the three porcine aromatase genes occurred ca. 33 mya, and the second occurred ca. 25 
mya. An evolutionary tree constructed from these NEDs is consistent with these conclusions, showing 
that the porcine aromatases branched after the lineage leading to pig diverged from the lineage leading to 
ox (Drawing 5). This tree shows a different branching order for the three porcine paralogs than the tree 
based on amino acid sequences, something not uncommon in the presence of substantial adaptive 
evolution. Nevertheless, the data are consistent with an evolutionary model that holds that the ancestor of 
pig and oxen (approximated in the fossil record most closely by the now extinct Diacodexis which lived 
perhaps 55 mya) contained a single aromatase gene, and that the paralogous genes in pig arose ca. 25 
million years later. Thus, the paralogs in pig can be explained neither in terms of the fiindamentals of 
vertebrate development, nor as a consequence of swine domestication. 

Error in these dates can arise from two sources, standard error (which arises from fluctuation) and 
systematic error (which arises from the fact that the evolutionary model does not represent acmal 
evolution). The first can be calculated by standard statistical approaches using standard statistical 
assumptions. The second cannot be calculated, as too little is known about possible systematic errors in 
the evolutionary model. The f2 distances are each based on ca. 120 two-fold redundant codon systems, 
and variances for the NEDs are given in Table 2. Inspection of the tree in Drawing 5 gives an indication of 
the actual error, as the NED between any ancestral sequences and all modem sequences derived from it 
should be the same. The calculated distance from the divergence of the three porcine enzymes to the type 
n enzyme is 3 1 million years, to isoform I is 32 million years, and to isoform m is 30 million years. 
Thus, the average reported (3 1 mya) could be as low as 30 and as high as 32 mya. All of these dates are in 
the Oligocene, after the first episode of cooling. The divergence of isoform I and ED ranges from 24-26 
mya. These apparent errors are less than the errors associated with the dating (from the fossil record) used 
to set the molecular clock. 

Instead, an understanding of why pigs have three genes for aromatase must lie in the environment of 
(and events that occurred during) a time on Earth 25-33 mya. For this we turn to the paleontological, 
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paleogeographical, and paleoclimatological records of that period, which is near the boundary between the 
Ohgocene (38-25 mya) and the Miocene (25-5 mya), two epochs in the Cenozoic "Age of Mammals" 
(Prothero, 1994). This period is an unusual one in the history of the Earth. When characterized globally, 
the Earth during the Eocene (54 - 38 mya) was warm and tropical, evidently free of ice over the entire 
planet. By the end of the Eocene, however, the Earth had begun to suffer a dramatic coohng that was to 
lower the mean annual temperature by as much as 15 °C (Wolfe, 1978). Areas of the planet became 
covered with ice. And the impact of die cooling on the biosphere was dramatic. For example, perhaps 80% 
of the North American faunal genera became extinct (Prothero pp 1 13-1 14; Stucky, 1990). By the end of 
the Ohgocene and into the Miocene 25 mya, however, the global cooling abated, the climate turned 
warmer, and the biosphere became more tropical. 

Did this climate change occur in the environment where the ancestors of modem pigs were Uving 
just before the Oligocene-Miocene boundary? At this time, the North American and Eurasian fauna were 
geographically isolated. Modem peccaries (Tayassuidae), not pigs, emerged in the New Worid from 
ancestral suids that immigrated from Asia. North America cannot be the site for the triplication of the 
aromatase genes in pig, therefore, and its cUmate 25-33 mya is irrelevant to an explanation for the 
tripUcation of the aromatase genes in pigs. 

Instead, modem pigs most likely emerged in Europe near the end of the Ohgocene (Cooke & 
Wilkinson, 1978, but see also Pilgrim, 1941) from more primitive entelodonts such as Archaeotherium, 
During the Ohgocene, the Dichobunids (the most probable ancestral stock) were most abundant in 
Europe. Likewise, the first tme pig, Propalaeochoerus, from the late Ohgocene, was common only in 
Europe (Cooke and Wilkinson, 1978; Carroll, 1988). This makes the paleoenvironment of Europe near 
the Ohgocene-Miocene boundary relevant to the functional implications of the aromatase gene triplication 
in pigs. 

Various paleobiological evidence suggests that the chmate in Europe also deteriorated in the 
Ohgocene and warmed in the Miocene. A study of amphibian distribution in the Ohgocene of Europe, for 
example, is consistent with a significant drop of mean annual temperatures in the European Ohgocene. hi 
the Miocene, amphibians populations rebounded, corresponding to an improvement in the chmate (Rocek, 
1996). Likewise, analysis of the deer population suggested a subtropical climate returning to Europe in the 
early Miocene (Anzanza, 1993). The Iberian peninsula in the early Miocene had an intertropical to 
subtropical chmate (Murelaga et al., 1999). Crocodiles also returned to Europe at the Oligocene-Miocene 
boundary (Antunes & Cahuzac, 1999). The presence of arboreal primates in the European Miocene also 
suggests a forested envuronment (Qi & Beard 1998). Each of these facts (and many others) suggests that 
the second duphcation of the aromatase gene in pigs occurred at the same time as the return of subtropical 
and warm temperate forests and woodlands to Europe, the type of envkonment for which suids are best 
adapted (Fortehus et al., 1996). 
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Immediately thereafter, the suids underwent a significant radiative divergence, and came to occupy all 
of the Old World. By the early Miocene, the two basal members that were to lead to all modem pigs, 
Hyotherium and Xenochoerus, were widespread in Europe, Asia, and Africa. The amelioration of the 
climate evidendy assisted in this spread. For example, the pigs now in Africa apparentiy came from 
southwest Asia in the Early Miocene. A fossil of this date of a tetraconodontine pig has been reported 
from the Levant (van der Made & Tuna, 1999), through which the pigs would have migrated to get from 
Eurasia to Africa, and which was a tropical environment at the beginning of the Miocene (Tchemov, 
1992). In the middle and late Miocene, modem suids had diversified in Europe in further response to the 
change in die paleoclimate (Fortelius et al., 1996). 

Why might a change in cUmate with a return of forested (and perhaps tropical) ecosystems have led 
to a selection of pigs that had three different aromatase genes? We turned to porcine reproductive 
physiology for insight. We recently found that die type m aromatase was expressed by the embryo 
between day 1 1 and day 13 following fertilization, during the late pre-implantation period (Choi et al., 
1997a,b). The esfrogen generated by the type HI isoform causes uterine undulation. This undulation, in 
turn, is expected to cause die spacing of die ca. 30 eggs that are fertilized in a typical conception, which 
eventuaUy yield die 8-12 piglets diat are normaUy birthed. In pigs, if die Utter does not contain at least 5 
individuals, die entire conception is aborted. Thus, die embryonic form of aromatase may have a role in 
spacing die embryos uniformly around the uterus, and preventing abortion. These are useftil adaptations if 
one wants to have an increased litter size. 

Evidence in the paleontological record suggests diat die size of die litter in pigs increased 
dramatically 25-30 mya, at die same time as isoform m of aromatase was generated by tiipUcation, die 
local paleocUmate wanned, and die pigs began a major radiative divergence. The ancestral suid 
Archaeotherium, disappearing from die fossil record at die end of die Oligocene, may have given birth to a 
single pup. All of die contemporary forms of pigs arising from the divergence of Hyotherium and 
Xenochoerus, known from die Early Miocene, have large Utter sizes. Further, Archaeomeryx, die early 
Eocene artiodactyl diat is presumed to be die ancesti-al ruminant, resembles die contemporary chevrotam, 
which also births a single pup. 

The biogeography of die suids was again consulted to test die hypodiesis that Utter size increased in 
the suids near die time diat die cUmate changed and die aromatase gene tiiplicated. As noted above, 
peccaries were isolated in die New World in die Early OUgocene, before die NED-derived date for die 
tiiplication of die aromatase gene in die Old Worid pigs. Consistent widi die model, die peccary has only 
one offspring. The model predicts as weU diat die peccary should have only a single aromatase gene. 

Pig 

Type I C AAT CAT TAG ACG TGC CGA TTI C3GC AGO AAA CTT GGG TTC SAA 
NHYTCRFGSKLGLE 
III T AGT CAC TAC ACA TCC CGA TTI GGC AGO AAA OCT GGG TTG GAG 
SHYTSRFGSKPGLQ 
II CAGTCACTACACATCCCGATTCGGCAGCAAACCTGGGTTCSAG 
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SHYTSRFGSKPGLE 
Peccary Ci«3TCACTACACATCCa3ATTCGCX:AGCAAA(XTGGGTroCAG 
SHYTSRFGSKPGLQ 

Pig 

Type I TGC ATT GGC ATG CAT GAA AAA GGC ATC ATG TTT AAC AAT AA 
CIGMHEKGIMFNNN 
III TTC ATT GGC ATG CAT GAG AAA GGC ATT ATA TTC AAC AAT AA 
FIGMHEKGIIFNNN 
II TGC ATC GGC ATG TAT GAG AAG GGC ATC ATA TTT AAT AAT GA 

CIGMYEKGIIF NND 
Peccary TTC ATT GGA ATG CAT GAG AAA GGC ATC ATA TTT AAC AAC AA 

FIGMHEKGIIFNNN 



To test this prediction, peccary seminal plasma (from the Center for Reproduction of Endangered 
Species, Zoological Society of San Diego) was subjected to PGR amplification using exon 4-specific 
primers as described above. Bands having the expected sizes were observed by agarose gel 
electrophoresis. Five clones derived from the PGR products were found to have identical sequences, all 
different from the sequences of the pig aromatase. The NED comparison (using a rate constant of 3 x 10" 
9 changes per base per year) suggested that the peccary diverged 40 mya from the pig, corresponding to 
the fossil record and the known isolation of the New and Old World paleoecosystems. 

The molecular biological, fossil, paleoecological, and physiological evidence are aU consistent with a 
model that proposes that climate changes in Europe at the end of the Oligocene selected for pigs that had 
larger litter sizes. The successful lineage generated a new embryo aromatase by gene duplication, and 
expressed it at the time of implantation, forming the molecular basis of the physiology that enabled large 
litter sizes. It is possible to speculate on why a conversion from an open, savannah like environment to a 
forested environment might enable larger litter sizes. Contemporary savannah babies are large and bom 
with the abihty to run, presumably because hiding is no alternative. In contrast, in a forested environment, 
pups are easier to hide, permitting them to be smaller and less precocious at birth, permitting in turn a 
larger number of pups for the same total birth weight. Indeed, the contemporary Sus scrofa sow hides her 
piglets in earthen hollows covered with leaves (Eisenberg, 1981). 

Implantation is one of the least weU understood steps in mammalian reproductive biology, including 
human reproductive biology. Implantation is, of course, found only in mammal reproductive physiology, 
and is itself therefore a relatively recent innovation in physiology, emerging perhaps 200 milUon years 
ago. This analysis emphasizes the degree of innovation and experimentation that is continuing in 
mammalian reproductive physiology. Further, the analysis is a combination of computational informatics, 
geology, paleontology, physiology, molecular biology and chemistry. Analogous analyses should be 
apphcable in fimctional genomics throughout the biological, biomedical and biochemical sciences, 
especially as genome projects are completed and as new tools become available to analyze genomic 
databases. 
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Example 2. Covarion behavior 

Functional changes leave signatures in the patterns of sequence evolution in a protein family. 
Covarion behavior was detected in alcohol dehydrogenase [Ben89] and superoxide dismutase 
[Miy95]. As a preliminary study in the past year, we examined elongation factors (EF). These are 
proteins that have diverged far more slowly; indeed, they are archetypal examples of a protein that 
performs the "same" function in all three kingdonis of hfe. 

In the study, thirty EF-Tu/EF-la protein sequences were aligned over 380 sites using the 
ahgnment program DARWIN. Replacement rates per site for bacterial and eukaiyotic EFs were 
estunated using a gamma-based, maximum Ukelihood (ML) model for protein sequences (JTT -i- r) 
and the phylogeny of Baldauf et al. [Bal96] for EF-Tu and EF-la. An a of 0.78 was calculated for 
the entire tree, with a standard deviation (SD) of 0.05 using parametric bootstrapping (evolutionary 
simulations) [Swo96]. Interestingly, the a values for the bacterial and eukaryotic subtrees were 
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significantly different from that for the entire tree [0.46 (0.04) and 0.38 (0.04), respectively]. These 
reductions in a for bacteria and eukaryotes alone are expected of a non-stationary covarion process. 

The distribution of rate differences per site between bacterial and eukaryotic EFs is leptokurtotic; 
i.e., over- and under-represented in the mean and tails versus "shoulders," respectively, relative to 
the expectations of a normal distribution. Thirty seven percent of the sites have essentially the same 
rate in the two groups (rate difference of -0), as expected under a stationary gamma process. 
However, 18 and 21 sites evolve >2 SD faster in bacteria than eukaryotes, and vice versa, respectively. 
These 10% of the sites are most responsible for the covarion characteristics of EF-Tu and EF-la. 

Residues displaying abnormal evolutionary behavior were then mapped to a three dimensional 
model of the protein based on a crystal structure of ET-Tu. These were used to generate structural 
hypotheses for the different behavioral differences that were known. For example, bacterial EF-Tu 
binds GDP ^100 fold tighter than GTP. Eukaryotic EF-la, in contrast, binds both with similar 
affinities. EF-Tu regenerates its active form by binding to the smgle-subunit nucleotide exchange 
factor EF-Ts. EF-la requires the multi-subunit nucleotide exchange factor EF-lpyS. EF-la also 
interacts with the cytoskeleton and may thereby play a role in cellular transformation and apoptosis. 
EF-Tu can have no such role in bacteria. Residues were identified that, at the level of hypothesis, are 
responsible for each of these behavioral differences. 

Covarion behavior indicates changing function. It is therefore expected to correlate positively with 
events with high KJK^ ratios. We will see if that is correct. Because K^/K^ ratios use a silent 
substitution clock that ticks rapidly, while covarion analysis does not, the two are somewhat 
complementary. This addresses anotiier of tiie concerns of the referees, who objected that Kg/Ks 
ratios were not applicable far enough back in time for their tastes (only ca. 500 my). 

Example 3. Identifying mutations and in vitro properties of seminal ribonuclease that contribute to 
selected function. 

Bovine seminal ribonuclease (RNase) diverged from bovine pancreatic RNase approximately 35 
million years ago. Seminal RNase represents approximately 2% of the total protein in bovine seminal 
plasma. It displays antispermatogenic activity [Dostal, J., Matousek, J. (1973) Isolation and some 
chemical properties of aspermatogenic substance from bull seminal vesicle fluid. 7. Reprod Fertil 33, 
263-274],immunosuppressive activity [Soucek, J., Matousek, J. (1981) Inhibitory effect of bovine seminal 
ribonuclease on activated lymphocytes and lymphoblastoid cell lines in vitro. Folia BioL Praha 27, 334- 
345. Soucek, J., Hruba, A., Paluska, E., Chudomel, V., Dostal, J., Matousek, J. (1983) Immunosuppressive 
effects of bovine seminal fluid fractions with ribonuclease activity. Folia biologica (Praha) 29, 250-261. 
Soucek, J., Chudomel, V„ Potmesilova, I., Novak, J. T. (1986) Effect of ribonucleases on cell, mediated 
lympholysis reaction and on GM, CFC colonies in bone marrow culture. Nat Immun, Cell Growth 
ReguL 5, 250-258], and cytostatic activity against many transformed cell lines [Matousek, J. (1973) The 
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effect of bovine seminal ribonuclease on cells of Crocker tumor in mice. Experientia 29, 858. Vescia, S., 
Tramontane, D., Augusti-Tocco, G., D'Alessio, G. (1980) In vitro studies on selective inhibition of tumor 
cell growth by seminal ribonuclease. Cancer Res. 40, 3740 ] Each of these biological activities is 
essentially absent from pancreatic RNase. Further, seminal RNase binds to anionic glycolipids, binds and 
melts duplex DNA, hydrolyzes duplex RNA, has a dimeric quaternary stmcture, and binds to 
spermatozoa. 

Each of these behaviors is measured in vitro and is well known in the art. In the absence of the method 
of the instant invention, the behaviors are difficult to interpret. Some, any, or all of the behaviors might 
serve an adaptive role. It is possible that none of these behaviors serve adaptive roles. Indeed, it is 
conceivable that the protein has no adaptive role at all. This makes it difficult to make even the simplest 
research decisions, as the only in vitro properties of a protein that are interesting to study are those that 
have a physiological function. 

To resolve these issues, genes for seminal and pancreatic RNases were obtained from a variety of 
organisms closely related to Bos taurus, using cloning procedures well known in the art. These were then 
sequenced, and a maximum parsimony tree was constructed using MacClade. From this tree were 
calculated the sequences of RNases that were intermediates in the evolution of the seminal RNase, using 
the maximum parsimony method well known in the art. 

Next, the ratio of expressed to silent substitutions was calculated along each branch of the 
evolutionary tree. A very high ratio of expressed to silent substitutions was observed in the evolutionary 
period following the divergence of kudu [Trabesinger-Ruf, N., Jermann, T. M., Zankel, T. R., Durrant, B., 
Frank, G., Benner. S. A. Pseudogenes in ribonuclease evolution. A source of new biomacromolecular 
function? FEBS Lett, 382, 319-322 (1996)] from the Uneage leading to ox, until the divergence of water 
buffalo and ox. This is indicative of an episode of adaptive evolution, where the protein acquires a new 
physiological function. Further work indicated that the seminal RNase gene was not expressed in the 
period of evolution since the divergence of the seminal RNase family and the divergence of kudu. 

Last, protein engineering methods were used to prepare the seminal RNase that was at the beginning 
of the episode of rapid sequence evolution. It properties were then examined experimentally. It was 
discovered that the ability of the protein to bind to anionic glycoUpids was roughly the same before and 
after this episode of rapid evolution. So too was its sensitivity to inhibition by placental RNase inhibitor. 
Thus, both of these properties are not likely to be under selective pressure. 

In contrast, the immunosuppressivity of the ancestral RNase (IC50 ca. 8 micrograms/mL) was greater 
than that of pancreatic RNase (IC50 ca. 100 micrograms/mL). But following the period of rapid sequence 
evolution characteristic of a protein evolving to serve a new physiological function, the 
immunosuppressivity became still greater (IC50 ca. 2 micrograms/mL). Thus, one concludes that 
immunosuppressivity as measured in vitro is a selected trait of the protein, or is closely structurally 
coupled to a trait that is selected. 
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Likewise, the ability of the seminal RNase protein to bind and melt duplex DNA, and to hydrolyze 
duplex RNA, also underwent rapid increase between the time of divergence of kudu from modem ox. 
Thus, it too is either a selected trait of the protein, or is closely structurally coupled to a trait that is 
selected. 

In vitro experiments in biological chemistry extract data on proteins and nucleic acids (for example) 
that are removed from their native environment, often in pure or purified states. While isolation and 
purification of molecules and molecular aggregates from biological systems is an essential part of 
contemporary biological research, the fact that the data are obtained in a non-native environment raises 
questions concerning their physiological relevance. Properties of biological systems determined in vitro 
need not correspond to those in vivo, and properties determined in vitro need have no biological relevance 
in vivo. 

To date, there has been no simple way to say whether or not biological behaviors are important 
physiologically to a host organism. Even in those cases where a relatively strong case can be made for 
physiological relevance (for example, for enzymes that catalyze steps in primary metaboUsm), it has 
proven to be difficult to decide whether individual properties of that enzymes (kcat? Km, kinetic order, 
stereospecificity, etc.) have physiological relevance. Especially difficult, however, is to ascertain which 
behaviors measures in vitro play roles in "higher" function in metazoa, including digestion, development, 
regulation, reproduction, and complex behavior. 

Analysis of non-Markovian behavior, as described above, permits the biological chemist to identify 
episodes in the history of a protein family where new function is emerging. This suggests a general 
method to determine whether a behavior measured in vitro is important to the evolution of new 
physiological function. We may take the following steps: 

(a) Prepare in the laboratory proteins that have the reconstmcted sequences corresponding to the 
ancestral proteins before, during, and after the evolution of new biological function (34), as revealed by an 
episode of high expressed to silent ratio of substitution in a protein. This high ratio compels the 
conclusion that the protein itself serves a physiological role, one that is changing during the period of 
rapid non-Markovian sequence evolution. 

(b) Measure in the laboratory the behavior in question in ancestral proteins before, during, and after the 
evolution of new biological fiinction, as revealed by an episode of high expressed to silent ratio of 
substitution. Those behaviors that increase during this episode are deduced to be important for 
physiological function. Those that do not are not. 

An example of this method was apphed to the bovine seminal ribonuclease (RNase) family. Bovine 
seminal RNase diverged from bovine pancreatic RNase approximately 35 million years ago. Seminal 
RNase represents approximately 2% of the total protein in bovine seminal plasma. It displays 
antispermatogenic activity [J. Dostal and J. Matousek, Isolation and some chemical properties of 
aspermatogenic substance from bull seminal vesicle fluid. 7. Reprod, Fertil 33, 263-274 (1973).], 
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immunosuppressive activity [J. Soucek, Matousek, J., Inhibitory effect of bovine seminal ribonuclease on 
activated lymphocytes and lymphoblastoid cell Unes in vitro. Folia Biol Praha 27, 334-345 (1981)., J. 
Soucek, A. Hruba, E. Paluska, V. Chudomel, J. DostaL and J. Matousek, Immunosuppressive effects of 
bovine seminal fluid fractions with ribonuclease activity. Folia biologica (Praha) 29, 250-261 (1983)., J. 
Soucek, V. Chudomel, I. Potmesilova, and J. T. Novak, Effect of ribonucleases on cell, mediated 
lympholysis reaction and on GM, CFC colonies in bone marrow culture. Nat Immun, Cell Growth 
ReguL 5, 250-258 (1986)], and cytostatic activity against many transformed cell lines [J. Matousek, The 
effect of bovine seminal ribonuclease on cells of Crocker tumor in mice. Experientia 29, 858-859 (1973), 
S. Vescia, D. Tramontano, G. Augusti-Tocco and G. D*alessio, In vitro studies on selective inhibition of 
tumor cell growth by seminal ribonuclease. Cancer Res. 40, 3740-3744 (1980)] Each of these biological 
activities is essentially absent from pancreatic RNase. Further, seminal RNase binds to anionic 
glycoUpids, binds and melts duplex DNA, hydrolyzes duplex RNA, has a dimeric quatemary structure, 
and binds to spermatozoa. 

Each of these behaviors is measured in vitro, as is the case for a wide range of biological 
phenomenology recorded in die literature. The behaviors are difficult to interpret. Some, any, or all of the 
behaviors might serve an adaptive role. It is possible that none of these behaviors serve adaptive roles. 
Indeed, it is conceivable that the protein has no adaptive role at all. This makes it difficult to make even the 
simplest research decisions, as the only in vitro properties of a protein that are interesting to study are 
those that have a physiological function. 

To resolve these issues using the post-genomic method outlined above, genes for seminal and 
pancreatic RNases were obtained from a variety of organisms closely related to Bos taurus, using cloning 
procedures well known in the art. These were then sequenced, and a maximum parsimony tree was 
constructed using MacClade. From this tree were calculated the sequences of RNases that were 
intermediates in the evolution of the seminal RNase, using the maximum parsimony method and checked 
using maximum likeUhood tools implemented in Darwin (23). 

Next, the ratio of expressed to silent substitutions was calculated along each branch of the evolutionary 
tree. A very high ratio of expressed to silent substitutions was observed in the evolutionary period 
following die divergence of cape buffalo [N. Trabesinger-RiiF, T. M. Jermann, T. R. Zankel, B. Durrant, 
G. Frank and S. A. Benner, Pseudogenes in ribonuclease evolution. A source of new biomacromolecular 
function? FEBS Lett, 382, 3 19-322 (1996).] from the lineage leading to ox, until the divergence of water 
buffalo and ox. This is indicative of an episode of adaptive evolution, where the protein acquires a new 
physiological function. Further work indicated tiiat the seminal RNase gene was not expressed in the 
period of evolution since the divergence of the seminal RNase family and the divergence of cape buffalo. 

Last, protein engineering methods were used to prepare the seminal RNase that existed at the 
beginning of the episode of rapid sequence evolution. Its properties were then examined experimentally. It 
was discovered that the ability of the protein to bind to anionic glycohpids was roughly the same before 
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and after this episode of rapid evolution. So too was its sensitivity to inhibition by placental RNase 
inhibitor. Thus, both of these properties are not likely to be under selective pressure. 

In contrast, the immunosuppressivity of the ancestral RNase (IC50 ca. 8 micrograms/mL) was greater 
than that of pancreatic RNase (IC50 ca. 100 microgranis/mL) (J. Sleasman, M. Rojas, personal 
communication). But following the period of rapid sequence evolution characteristic of a protein evolving 
to serve a new physiological function, the immunosuppressivity became still greater (IC50 ca. 2 
micrograms/mL). Thus, one concludes that immunosuppressivity as measured in vitro is a selected trait of 
the protein, or is closely structurally coupled to a trait that is selected. 

Likewise, the abihty of the seminal RNase protein to bind and melt duplex DNA, and to hydrolyze 
duplex RNA, also underwent rapid increases between the time of divergence of cape buffalo from modem 
ox. Thus, it too is either a selected trait of the protein, or is closely structurally coupled to a trait that is 
selected. In contrast, dimeric structure did not emerge during this period. Dimeric structure, therefore, is 
presumably not as important to the new selected function of the protein, although it may be a trait that was 
initially useful in the selection of the system for further optimization during the period of rapid evolution. 

Example 4. Assignment of episodes of adaptive evolution in the protein leptin, and placing these in 
predicted secondary structural elements 

From the GenBank database, DNA and protein sequences were retrieved for the genes encoding 
leptins and the corresponding proteins, also known as the obesity gene product. A multiple alignment for 
the protein sequences was constructed for the DNA sequences and the protein sequences. These were 
converted to a file suitable for MacClade to use. For both the DNA and protein sequences, a tree using 
MacClade was built based on the known relationship between the organisms from which these sequences 
were derived; this proved to be the most parsimonious tree as well. MacClade was also used to built a tree 
for the protein sequences based on the known relationship between organisms; this proved not to be the 
most parsimonious tree (by 1 change). The DNA tree was taken to be definitive because of its consistency 
with the biological (cladistic) data showing that the primates form a clade. 

A secondary structure prediction was made for the protein family using the tools disclosed in Serial 
No. 07/857,224. The evolutionary divergence of the sequences available for the leptin family is small; only 
21 PAM units (point accepted mutations per 100 amino acids), predictions were biased to favor surface 
assignments [Benner, S. A., Badcoe, L, Cohen, M. A., Gerloff, D. L. Bona fide prediction of aspects of 
protein conformation. Assigning interior and surface residues from patterns of variation and conservation 
in homologous protein sequences. /. MoL Biol. 235, 926-958 (1994)]. Thus, positions holding conserved 
KREND were assigned as surface residues, conserved H and Q were assigned to the surface as well, 
while positions holding conserved CST were assigned as uncertain, suface and interior assignments are 
summarized in Table 3. 
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A secondary structure was then predicted for the leptins using the methods disclosed in Serial No. 
07/857,224. The multiple alignment is shown in Table 3. Five separate secondary structural elements were 
identified results are summarized in Table 3. A disulfide bond is presumed to connect positions 96 and 
146. These secondary structural elements can be accommodated by only a small number of overall folds. 
Interestingly, the pattern of secondary structure in this prediction is consistent with an overall fold that 
resembles that seen in cytokines such as colony stimulating factor [Hill, C, P., Osslund, T. D., Eisenberg, 
D. (1993) Proc, Nat. Acad ScL 90, 5176-5181] and human growth hormone [de Vos, A. M., Ultsch, M. 
& Kossiakoff, A. A. (1992). Science 255, 306-312]. 

To decide whether evolutionary function may have changed under selective pressure during the 
divergent evolution of the protein family, a multiple ahgnment of the protein sequences and a multiple 
aUgnment for the corresponding DNA sequences were constructed. A MacClade-generated maximum 
parsimony tree was printed for each position in the protein sequence where there was a change, and for 
each position in the DNA sequence where there was a change. Each mutation on each tree was examined 
^ by hand, and silent and expressed mutations occurred were assigned to individual branches on the 
2 evolutionary tree. For each branch of the tree, the sum of the number of silent and expressed changes were 
m tabulated, and the ratio of expressed to silent changes calculated. These are shown in Drawing 1. Tables 4 
5 and 5 contain the data used in this example. 

Si The branches on the evolutionary tree leading to the primate leptins from their ancestors at the time 
M that rodents and primates diverged had an extremely high ratio of expressed to silent changes. From this 
. analysis, it was concluded that the biological function of leptins has changed significantly in the primates 
□ rlative to the function of the leptin in the common ancestor of primates and rodents. 

This approach can be illustrated in a biomedically interesting family of proteins by examining the 
g protein leptin, a protein whose mutation in mice is evidently correlated with obesity, and was previously 
M known as the "obesity gene protein". The protein has attracted substantial interest in the pharmaceutical 
~ industry, especially after a human gene encoding a leptin homolog was isolated. According to the 

conventional evolutionary paradigm, because it is a homolog of the mouse leptin, the human leptin must 
also play a role in obesity, and might be an appropriate target for pharmaceutical companies seeking 
human pharmaceuticals to combat this conmion condition in the first world. 

DNA and protein sequences were retrieved for the genes encoding leptins. A multiple alignment for the 
protein sequences was constructed for the DNA sequences and the protein sequences. Congruent tress for 
both the DNA and protein sequences were then constructed, and sequences at the nodes of the tree 
reconstructed using MacClade [W. P. Maddison, D. R. Maddison, MacClade, Analysis ofPhylogeny and 
Character Evolution, Sinauer Associates, Sunderland MA (1992).] and the known relationship between 
the organisms from which these sequences were derived. For the DNA sequences, the biologically most 
plausible tree proved to be the most parsimonious tree as well. The most parsimonious tree for the protein 
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sequences proved not to be the most plausible tree (by one change) from a biological perspective. The 
DNA tree was taken to be definitive because of its consistency with the biological (cladistic) data. 

A secondary structure prediction was made for the protein family. The evolutionary divergence of the 
sequences available for the leptin family is small - only 21 PAM units (point accepted mutations per 100 
amino acids) - and predictions were biased to favor surface assignments [S. A. Benner, L Badcoe, M. A. 
Cohen and D. L. Gerloff, Bona fide prediction of aspects of protein conformation. Assigning interior and 
surface residues from patterns of variation and conservation in homologous protein sequences. 7. Mol. 
Biol 235, 926-958 (1994).]. Thus, positions holding conserved KREND were assigned as surface 
residues, conserved H and Q were assigned to the surface as well, while positions holding conserved CST 
were assigned as uncertain. 

Five separate secondary structural elements were identified. A disulfide bond was presumed to connect 
positions 96 and 146. These secondary structural elements can be acconunodated by only a small number 
of overall folds. Interestingly, the pattern of secondary structure in this prediction is consistent with an 
overall fold that resembles that seen in cytokines such as colony stimulating factor [C. P. Hill, T. D. 
Osslund and D. Eisenberg, The structure of granulocyte colony stimulating factor and its relationship to 
other growth factors. Proc. Nat, Acad. 5c/. 90, 5176-5181 (1993).] and human growth hormone [A. M. 
De Vos, M. Ultsch and A. A. Kossiakoff, Human growth-hormone and extracellular domain of its 
receptor. Crystal-structure of the complex.5dence 255, 306-312 (1992).]. 

To decide whether evolutionary fiinction may have changed under selective pressure during the 
divergent evolution of the protein family, silent and expressed mutations were assigned to individual 
branches on the evolutionary tree. For each branch of the tree, the sum of the number of silent and 
expressed changes were tabulated, and the ratio of expressed to silent changes calculated. These are 
shown in Drawing 2. 

The branches on the evolutionary tree leading to the primate leptins from their ancestors at the time that 
rodents and primates diverged had an extremely high ratio of expressed to silent changes. From this 
analysis, it was concluded that the biological function of leptins has changed significantly in the primates 
relative to the function of the leptin in the common ancestor of primates and rodents. This conclusion has 
several impUcations of importance, not the least being for pharmaceutical companies asked whether they 
should explore leptins as a pharmaceutical target. At the very least, it suggests that the mouse is not a good 
pharmacological model for compounds to be tested for their ability to combat obesity in humans. The 
post-genomic analysis suggests that a primate model must be used to test those compounds, with 
impUcations for the cost of developing an anti-obesity drug based on the leptin protein. 

Intriguingly, a tree can also be built for the leptin receptor. Here, the evolutionary history is not so 
complete. In particular, fewer primate sequences are available for the leptin receptor than for leptin itself. 
Thus, the reconstructed ancestral sequences are less precise with the leptin receptor family, and the 
assignment of expressed and silent mutations to the tree are less certain. Nevertheless, it appears that the 
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leptin receptor has undergone an episode of rapid sequence evolution in the primate half of the family as 
well. The example illustrates how much sequence data is needed (much) to build rehable models of this 
nature, as the ambiguity in the assignment of ancestral sequences makes it possible that the receptor was 
evolving rapidly not only in the lineage leading to primates but also in the lineage leading to mouse. 

Nevertheless, the approximate correlation between the episode of rapid sequence evolution in the leptin 
family and in the leptin receptor family suggests a tool that might become useful in the advanced stages of 
post-genomic science when evolutionary histories are very well articulated. Here, it might be possible to 
detect hgand-receptor relationships between protein families in the database by a correspondence between 
their episodes of rapid sequence evolution. Thus, ligand families should evolve rapidly (in a non- 
Markovian fashion) at the same time in geological history as their receptors evolve. It will be interesting to 
identify more sequences for primate leptin receptors to see if a more complete evolutionary history allows 
us to see more clearly the co-evolution of the leptin receptor and leptin itself. 

Example 5 Alcohol dehydrogenase 

Q Manmiahan alcohol dehydrogenase (E.G. 1 . 1 . 1 . 1 ) have undergone a rapid episode of sequence 

evolution in and around the active site as substrate specificity has divergently evolved to handle xenobiotic 
£ substances in the liver. In contrast, over a comparable span of evolutionary distance, the active site of yeast 
alcohol dehydrogenase has changed very little, corresponding to an apparently constant role of the enzyme 
Q to act on the ethanol-acetaldehyde redox couple. Indeed, by identifying positions in manmialian 
^ dehydrogenases where amino acid variation was observed over a span of evolution where the same 
□ residues were conserved in the yeast dehydrogenases provided a clear map of the active site of the protein. 
CQ 

^ Example 6 Notch protein 

O A set of Notch homologs were obtained, and used to buid a multiple sequence alignment, and 
Q evolutionary tree (Drawing 6) and reconsructed intermediates throughout the evolutionary tree. 

The functional interpretation based on these tools proceeded as follows. First, the f2 values showed 
that the silent substitutions were not equilibrated over much of the tree. However, the f2 value becomes 
close to 0.5 at points where the phyla diverge, suggesting near equilibration in the silent values. This 
defines the root of the tree near node 13. Ka/Ks values are given on the branches (numbers in italics). 
They suggest at the level of hypothesis that notch 1, notch 3 and notch are proteins with derived functions, 
while notch 4 is the paralog in mammals with the ancestral function. The rate constant for silent 
substitution is calculated to be ca. 23 x 10'^ changes/base per hear. This suggests that the notch paralogs 
diverged ca. 400 MYA. This is at the time of the development of advanced organis in vertebrates, 
suggesting that the Notch paralogs with derived function in the vertebrates are important for this level of 
organogenesis. 
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Example 7. C. elegans paralogs 

NED distances are especially useful when comparing paralogs. Here, we need not worry so much 
about codon bias (it has at least been uniform among paralogs at any instant in evolutionary history). For 
example, we used the Master Catalog to identify all families of paralogs in the genome of C. elegans, Ca. 
1250 families of paralogs with four or more members is found. We separated the families into in various 
classes using NED dates. 

(a) Families where duphcations all occurred > 400 MYA 

(b) Families where duphcations all occurred < 100 MYA 

(c) Families where duplications have been ongoing throughout the past 400 MY. 

(d) FamiUes with duplications in specific episodes. 

(e) Families showing a history of duplication > 400 MYA, but also having more recent episodes of 
recruitment. 



Table 2 presents data from just five of these 1250 families. 



Number of nodes 


generating 


paralogs 


in indicated 


time 




MYA 0 


-100 


100-200 


200-300 


300-400 


>400 


gprod_19 987 


3 9 
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similar to reverse transcriptase 
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Histone H2A 
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5 


2 


0 


0 


2 


No definition line 












gprod_19811 


5 


2 


3 


5 


3 9 


Serine-threonine kinase. 













If the reviewer is a biomedical scientist, the Table immediately suggests ideas. Consider the family 
annotated as a serine-threonine kinase. It has 145 members in the Master Catalog; 55 or these are from 
elegans. The kinases generated by the recent duplications cannot part of the basic developmental plan of 
elegans; this was estabUshed 500 MYA. This raises questions: What is it about the serine-threonine 
kinases that recentiy diverged that might have something to do with recendy evolved physiology? We then 
examine the Ka/Ks value within die Master Catalog trees, all with a click of a mouse button. We 
hypothesize which descendants of recent duphcations performing the derived function, and which perform 
the primitive function. Dating the divergence, we try to make statements about changes in nematode 
biology that might be associated with the duplication. These hypotheses can now be tested by experiment 
(knock-outs, in particular). 



gprod_1025 

gprod_1063 

gprod_1069 

gprod_10729 
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gprod_1090 



family 



A 

0- 

0.5 

0 

0 

0 



B 

0.5- 
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0 

0 

0 



c 

1.0- 
1.5 
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D 

1.5- 

2.0 

0 

0 

0 

0 

0 
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E F 
2.0-sum 
2,5 



5 5 

2 3 

2 3 
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5 6 
3 3 



average 
#char 



143.4 
46 
3 
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0 
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0 
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35.3333 
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One observation apparent from the Table is that genes that have multiple recent recruitments in C. 
elegans are unlikely to have clearly identifiable homologs in otiier phyla, while those that have few recent 
recruitments are more likely than average to have clearly identifiable homologs in other phyla. 
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